IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1012061.html
   My bibliography  Save this article

A systematic analysis of regression models for protein engineering

Author

Listed:
  • Richard Michael
  • Jacob Kæstel-Hansen
  • Peter Mørch Groth
  • Simon Bartels
  • Jesper Salomon
  • Pengfei Tian
  • Nikos S Hatzakis
  • Wouter Boomsma

Abstract

To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.Author summary: Supervised machine learning is increasingly used to predict the function and properties of proteins. The performance obtained with these methods relies on a multitude of factors including how data is represented, how observations are distributed, how training is conducted, and how performance is measured. In this paper, we systematically assess the importance of these different components in a protein regression pipeline. We discuss the benefits of using representations extracted from protein language models, the impact of the choice of regression algorithm, and the role of uncertainty. Finally, to avoid misleading performance claims, we stress the need for carefully aligning the train/test setup to reflect the setting in which the prediction algorithm will ultimately be applied.

Suggested Citation

  • Richard Michael & Jacob Kæstel-Hansen & Peter Mørch Groth & Simon Bartels & Jesper Salomon & Pengfei Tian & Nikos S Hatzakis & Wouter Boomsma, 2024. "A systematic analysis of regression models for protein engineering," PLOS Computational Biology, Public Library of Science, vol. 20(5), pages 1-22, May.
  • Handle: RePEc:plo:pcbi00:1012061
    DOI: 10.1371/journal.pcbi.1012061
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012061
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1012061&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1012061?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Marc C. Kennedy & Anthony O'Hagan, 2001. "Bayesian calibration of computer models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 63(3), pages 425-464.
    2. Nicki Skafte Detlefsen & Søren Hauberg & Wouter Boomsma, 2022. "Learning meaningful representations of protein sequences," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    3. Yvonne H. Chan & Sergey V. Venev & Konstantin B. Zeldovich & C. Robert Matthews, 2017. "Correlation of fitness landscapes from three orthologous TIM barrels originates from sequence and structure constraints," Nature Communications, Nature, vol. 8(1), pages 1-12, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Vanslette, Kevin & Tohme, Tony & Youcef-Toumi, Kamal, 2020. "A general model validation and testing tool," Reliability Engineering and System Safety, Elsevier, vol. 195(C).
    2. Matthias Katzfuss & Joseph Guinness & Wenlong Gong & Daniel Zilber, 2020. "Vecchia Approximations of Gaussian-Process Predictions," Journal of Agricultural, Biological and Environmental Statistics, Springer;The International Biometric Society;American Statistical Association, vol. 25(3), pages 383-414, September.
    3. Jakub Bijak & Viet Dung Cao & Eric Silverman & Jason Hilton, 2013. "Reforging the Wedding Ring," Demographic Research, Max Planck Institute for Demographic Research, Rostock, Germany, vol. 29(27), pages 729-766.
    4. Hao Wu & Michael Browne, 2015. "Random Model Discrepancy: Interpretations and Technicalities (A Rejoinder)," Psychometrika, Springer;The Psychometric Society, vol. 80(3), pages 619-624, September.
    5. Villez, Kris & Del Giudice, Dario & Neumann, Marc B. & Rieckermann, Jörg, 2020. "Accounting for erroneous model structures in biokinetic process models," Reliability Engineering and System Safety, Elsevier, vol. 203(C).
    6. Xiaoyu Xiong & Benjamin D. Youngman & Theodoros Economou, 2021. "Data fusion with Gaussian processes for estimation of environmental hazard events," Environmetrics, John Wiley & Sons, Ltd., vol. 32(3), May.
    7. Petropoulos, G. & Wooster, M.J. & Carlson, T.N. & Kennedy, M.C. & Scholze, M., 2009. "A global Bayesian sensitivity analysis of the 1d SimSphere soil–vegetation–atmospheric transfer (SVAT) model using Gaussian model emulation," Ecological Modelling, Elsevier, vol. 220(19), pages 2427-2440.
    8. David Breitenmoser & Francesco Cerutti & Gernot Butterweck & Malgorzata Magdalena Kasprzak & Sabine Mayer, 2023. "Emulator-based Bayesian inference on non-proportional scintillation models by compton-edge probing," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    9. Drignei, Dorin, 2011. "A general statistical model for computer experiments with time series output," Reliability Engineering and System Safety, Elsevier, vol. 96(4), pages 460-467.
    10. Yuan, Jun & Nian, Victor & Su, Bin & Meng, Qun, 2017. "A simultaneous calibration and parameter ranking method for building energy models," Applied Energy, Elsevier, vol. 206(C), pages 657-666.
    11. Barde, Sylvain, 2024. "Bayesian estimation of large-scale simulation models with Gaussian process regression surrogates," Computational Statistics & Data Analysis, Elsevier, vol. 196(C).
    12. Gross, Eitan, 2015. "Effect of environmental stress on regulation of gene expression in the yeast," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 430(C), pages 224-235.
    13. Hwang, Youngdeok & Kim, Hang J. & Chang, Won & Yeo, Kyongmin & Kim, Yongku, 2019. "Bayesian pollution source identification via an inverse physics model," Computational Statistics & Data Analysis, Elsevier, vol. 134(C), pages 76-92.
    14. Choi, Wonjun & Menberg, Kathrin & Kikumoto, Hideki & Heo, Yeonsook & Choudhary, Ruchi & Ooka, Ryozo, 2018. "Bayesian inference of structural error in inverse models of thermal response tests," Applied Energy, Elsevier, vol. 228(C), pages 1473-1485.
    15. Yuan, Jun & Ng, Szu Hui, 2013. "A sequential approach for stochastic computer model calibration and prediction," Reliability Engineering and System Safety, Elsevier, vol. 111(C), pages 273-286.
    16. Edward Boone & Jan Hannig & Ryad Ghanam & Sujit Ghosh & Fabrizio Ruggeri & Serge Prudhomme, 2022. "Model Validation of a Single Degree-of-Freedom Oscillator: A Case Study," Stats, MDPI, vol. 5(4), pages 1-17, November.
    17. Overstall, Antony M. & Woods, David C. & Martin, Kieran J., 2019. "Bayesian prediction for physical models with application to the optimization of the synthesis of pharmaceutical products using chemical kinetics," Computational Statistics & Data Analysis, Elsevier, vol. 132(C), pages 126-142.
    18. Abokersh, Mohamed Hany & Vallès, Manel & Cabeza, Luisa F. & Boer, Dieter, 2020. "A framework for the optimal integration of solar assisted district heating in different urban sized communities: A robust machine learning approach incorporating global sensitivity analysis," Applied Energy, Elsevier, vol. 267(C).
    19. Campbell, Katherine, 2006. "Statistical calibration of computer simulations," Reliability Engineering and System Safety, Elsevier, vol. 91(10), pages 1358-1363.
    20. Ioannis Andrianakis & Ian R Vernon & Nicky McCreesh & Trevelyan J McKinley & Jeremy E Oakley & Rebecca N Nsubuga & Michael Goldstein & Richard G White, 2015. "Bayesian History Matching of Complex Infectious Disease Models Using Emulation: A Tutorial and a Case Study on HIV in Uganda," PLOS Computational Biology, Public Library of Science, vol. 11(1), pages 1-18, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1012061. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.