IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1012135.html
   My bibliography  Save this article

Leveraging conformal prediction to annotate enzyme function space with limited false positives

Author

Listed:
  • Kerr Ding
  • Jiaqi Luo
  • Yunan Luo

Abstract

Machine learning (ML) is increasingly being used to guide biological discovery in biomedicine such as prioritizing promising small molecules in drug discovery. In those applications, ML models are used to predict the properties of biological systems, and researchers use these predictions to prioritize candidates as new biological hypotheses for downstream experimental validations. However, when applied to unseen situations, these models can be overconfident and produce a large number of false positives. One solution to address this issue is to quantify the model’s prediction uncertainty and provide a set of hypotheses with a controlled false discovery rate (FDR) pre-specified by researchers. We propose CPEC, an ML framework for FDR-controlled biological discovery. We demonstrate its effectiveness using enzyme function annotation as a case study, simulating the discovery process of identifying the functions of less-characterized enzymes. CPEC integrates a deep learning model with a statistical tool known as conformal prediction, providing accurate and FDR-controlled function predictions for a given protein enzyme. Conformal prediction provides rigorous statistical guarantees to the predictive model and ensures that the expected FDR will not exceed a user-specified level with high probability. Evaluation experiments show that CPEC achieves reliable FDR control, better or comparable prediction performance at a lower FDR than existing methods, and accurate predictions for enzymes under-represented in the training data. We expect CPEC to be a useful tool for biological discovery applications where a high yield rate in validation experiments is desired but the experimental budget is limited.Author summary: Machine learning (ML) models are increasingly being applied as predictors to generate biological hypotheses and guide biological discovery. However, when applied to unseen situations, ML models can be overconfident and make enormous false positive predictions, resulting in the challenges for researchers to trade-off between high yield rates and limited budgets. One solution is to quantify the model’s prediction uncertainty and generate predictions at a controlled false discovery rate (FDR) pre-specified by researchers. Here, we introduce CPEC, an ML framework designed for FDR-controlled biological discovery. Using enzyme function prediction as a case study, we simulate the process of function discovery for less-characterized enzymes. Leveraging a statistical framework, conformal prediction, CPEC provides rigorous statistical guarantees that the FDR of the model predictions will not surpass a user-specified level with high probability. Our results suggested that CPEC achieved reliable FDR control for enzymes under-represented in the training data. In the broader context of biological discovery applications, CPEC can be applied to generate high-confidence hypotheses and guide researchers to allocate experimental resources to the validation of hypotheses that are more likely to succeed.

Suggested Citation

  • Kerr Ding & Jiaqi Luo & Yunan Luo, 2024. "Leveraging conformal prediction to annotate enzyme function space with limited false positives," PLOS Computational Biology, Public Library of Science, vol. 20(5), pages 1-21, May.
  • Handle: RePEc:plo:pcbi00:1012135
    DOI: 10.1371/journal.pcbi.1012135
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012135
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1012135&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1012135?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Yunan Luo & Guangde Jiang & Tianhao Yu & Yang Liu & Lam Vo & Hantian Ding & Yufeng Su & Wesley Wei Qian & Huimin Zhao & Jian Peng, 2021. "ECNet is an evolutionary context-integrated deep learning framework for protein engineering," Nature Communications, Nature, vol. 12(1), pages 1-14, December.
    2. Vladimir Gligorijević & P. Douglas Renfrew & Tomasz Kosciolek & Julia Koehler Leman & Daniel Berenberg & Tommi Vatanen & Chris Chandler & Bryn C. Taylor & Ian M. Fisk & Hera Vlamakis & Ramnik J. Xavie, 2021. "Structure-based protein function prediction using graph convolutional networks," Nature Communications, Nature, vol. 12(1), pages 1-14, December.
    3. Ross D. King & Kenneth E. Whelan & Ffion M. Jones & Philip G. K. Reiser & Christopher H. Bryant & Stephen H. Muggleton & Douglas B. Kell & Stephen G. Oliver, 2004. "Functional genomic hypothesis generation and experimentation by a robot scientist," Nature, Nature, vol. 427(6971), pages 247-252, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Karel Weg & Erinc Merdivan & Marie Piraud & Holger Gohlke, 2025. "TopEC: prediction of Enzyme Commission classes by 3D graph neural networks and localized 3D protein descriptor," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    2. Wang, Zixuan & Chen, Zijian & Wang, Boyuan & Wu, Chuang & Zhou, Chao & Peng, Yang & Zhang, Xinyu & Ni, Zongming & Chung, Chi-yung & Chan, Ching-chuen & Yang, Jian & Zhao, Haitao, 2025. "Digital manufacturing of perovskite materials and solar cells," Applied Energy, Elsevier, vol. 377(PB).
    3. Ziqi Gao & Chenran Jiang & Jiawen Zhang & Xiaosen Jiang & Lanqing Li & Peilin Zhao & Huanming Yang & Yong Huang & Jia Li, 2023. "Hierarchical graph learning for protein–protein interaction," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    4. Pat Langley, 2019. "Scientific discovery, causal explanation, and process model induction," Mind & Society: Cognitive Studies in Economics and Social Sciences, Springer;Fondazione Rosselli, vol. 18(1), pages 43-56, June.
    5. Yinghui Chen & Yunxin Xu & Di Liu & Yaoguang Xing & Haipeng Gong, 2024. "An end-to-end framework for the prediction of protein structure and fitness from single sequence," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    6. Nan Zheng & Yongchao Cai & Zehua Zhang & Huimin Zhou & Yu Deng & Shuang Du & Mai Tu & Wei Fang & Xiaole Xia, 2025. "Tailoring industrial enzymes for thermostability and activity evolution by the machine learning-based iCASE strategy," Nature Communications, Nature, vol. 16(1), pages 1-13, December.
    7. Ziyi Zhou & Liang Zhang & Yuanxi Yu & Banghao Wu & Mingchen Li & Liang Hong & Pan Tan, 2024. "Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    8. Stefanie Duller & Simone Vrbancic & Łukasz Szydłowski & Alexander Mahnert & Marcus Blohs & Michael Predl & Christina Kumpitsch & Verena Zrim & Christoph Högenauer & Tomasz Kosciolek & Ruth A. Schmitz , 2024. "Targeted isolation of Methanobrevibacter strains from fecal samples expands the cultivated human archaeome," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    9. Filippo Caschera & Gianluca Gazzola & Mark A Bedau & Carolina Bosch Moreno & Andrew Buchanan & James Cawse & Norman Packard & Martin M Hanczyc, 2010. "Automated Discovery of Novel Drug Formulations Using Predictive Iterated High Throughput Experimentation," PLOS ONE, Public Library of Science, vol. 5(1), pages 1-8, January.
    10. Yaan J. Jang & Qi-Qi Qin & Si-Yu Huang & Arun T. John Peter & Xue-Ming Ding & Benoît Kornmann, 2024. "Accurate prediction of protein function using statistics-informed graph networks," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    11. Kerr Ding & Michael Chin & Yunlong Zhao & Wei Huang & Binh Khanh Mai & Huanan Wang & Peng Liu & Yang Yang & Yunan Luo, 2024. "Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    12. Zixuan Fan & Yan Xu, 2024. "Predicting the Functional Changes in Protein Mutations Through the Application of BiLSTM and the Self-Attention Mechanism," Annals of Data Science, Springer, vol. 11(3), pages 1077-1094, June.
    13. Erevelles, Sunil & Fukawa, Nobuyuki & Swayne, Linda, 2016. "Big Data consumer analytics and the transformation of marketing," Journal of Business Research, Elsevier, vol. 69(2), pages 897-904.
    14. Samuel Miravet-Verde & Rocco Mazzolini & Carolina Segura-Morales & Alicia Broto & Maria Lluch-Senar & Luis Serrano, 2024. "ProTInSeq: transposon insertion tracking by ultra-deep DNA sequencing to identify translated large and small ORFs," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    15. Julia Koehler Leman & Pawel Szczerbiak & P. Douglas Renfrew & Vladimir Gligorijevic & Daniel Berenberg & Tommi Vatanen & Bryn C. Taylor & Chris Chandler & Stefan Janssen & Andras Pataki & Nick Carrier, 2023. "Sequence-structure-function relationships in the microbial protein universe," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    16. Marco Malatesta & Emanuele Fornasier & Martino Luigi Salvo & Angela Tramonti & Erika Zangelmi & Alessio Peracchi & Andrea Secchi & Eugenia Polverini & Gabriele Giachin & Roberto Battistutta & Roberto , 2024. "One substrate many enzymes virtual screening uncovers missing genes of carnitine biosynthesis in human and mouse," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    17. Shunshi Kohyama & Béla P. Frohn & Leon Babl & Petra Schwille, 2024. "Machine learning-aided design and screening of an emergent protein function in synthetic cells," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    18. Steve O'Hagan & Joshua Knowles & Douglas B Kell, 2012. "Exploiting Genomic Knowledge in Optimising Molecular Breeding Programmes: Algorithms from Evolutionary Computing," PLOS ONE, Public Library of Science, vol. 7(11), pages 1-14, November.
    19. William Mo & Christopher A. Vaiana & Chris J. Myers, 2024. "The need for adaptability in detection, characterization, and attribution of biosecurity threats," Nature Communications, Nature, vol. 15(1), pages 1-9, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1012135. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.