Author
Listed:
- Shrishti Barethiya
- Jian Huang
- Clarice Stumpf
- Xiao Liu
- Hui Guan
- Jianhan Chen
Abstract
Understanding protein sequence-to-function relationship is crucial to assist studies of genetic diseases, protein evolution, and protein engineering. The sequence-to-function relationship of proteins is inherently complex due to multi-site high-dimensional correlation and structural dynamics. Deep learning algorithms such as (graph) convolutional neural networks and recently transformers have become very popular for learning the protein sequence-to-function mapping from deep mutational scanning data and available structures. However, it remains very challenging for these models to achieve accurate extrapolation when predicting functional effect of variants with positions or mutation types not seen in the training data. We propose that incorporating the physics of protein interactions and dynamics can be an effective approach to overcome the extrapolation limitations. Specifically, we demonstrate that biophysics-based modeling can be used to quantify the energetic effects of mutations and that incorporating these physical energetics directly within the convolution and graph convolution neural networks can significantly improve the performance of positional and mutational extrapolation compared to models without biophysics-inspired features. Our results support the effectiveness of leveraging physical knowledge in overcoming the limitation of data scarcity.Author summary: Deep learning has fundamentally transformed science and research in recent years. Yet, many problems in biophysics and biochemistry remain inaccessible to traditional deep learning due to a lack of large training data. Incorporating physical principles in machine learning is arguably required to overcome data scarcity. In this work, we examine the effectiveness of incorporating biophysics-based features in deriving more reliable predictors of the effects of sequence variants on protein function. Our results show that including the energetics of mutational effect on protein stability can significantly improve machine learning models’ ability to predict novel mutations not seen in the training data set, especially for mutations on novel sequence positions. Further incorporation of sequence evolutionary information offered by pre-trained protein large language models could further improve the predictive power. Our work thus provides an efficient framework for training better variant effect predictors from deep mutational scanning dataset. The result predictors can aid protein engineering and the prioritization of studying genetic variations in diseases.
Suggested Citation
Shrishti Barethiya & Jian Huang & Clarice Stumpf & Xiao Liu & Hui Guan & Jianhan Chen, 2026.
"Overcoming extrapolation challenges of deep learning by incorporating physics in protein sequence-function modeling,"
PLOS Computational Biology, Public Library of Science, vol. 22(3), pages 1-23, March.
Handle:
RePEc:plo:pcbi00:1013728
DOI: 10.1371/journal.pcbi.1013728
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1013728. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.