Author
Listed:
- Lukas Huber
(ETH Zurich)
- Tim Kucera
(ETH Zurich
Swiss Institute of Bioinformatics
Max Planck Institute of Biochemistry)
- Simon Höllerer
(ETH Zurich)
- Karsten Borgwardt
(ETH Zurich
Swiss Institute of Bioinformatics
Max Planck Institute of Biochemistry)
- Sven Panke
(ETH Zurich)
- Markus Jeschek
(ETH Zurich
University of Regensburg
École Polytechnique Fédérale de Lausanne (EPFL))
Abstract
Protein engineering has recently seen tremendous transformation due to machine learning (ML) tools that predict structure from sequence at unprecedented precision. Predicting catalytic activity, however, remains challenging, restricting our capabilities to design protein sequences with desired catalytic function in silico. This predicament is mainly rooted in a lack of experimental methods capable of recording sequence-activity data in quantities sufficient for data-intensive ML techniques, and the inefficiency of searches in the enormous sequence spaces inherent to proteins. Herein, we address both limitations in the context of engineering proteases with tailored substrate specificity. We introduce a DNA recorder for deep specificity profiling of proteases in Escherichia coli as we demonstrate testing 29,716 candidate proteases against up to 134 substrates in parallel. The resulting sequence-activity data on approximately 600,000 protease-substrate pairs does not only reveal key sequence determinants governing protease specificity, but allows to build a data-efficient deep learning model that accurately predicts protease sequences with desired on- and off-target activities. Moreover, we present epistasis-aware training set design as a generalizable strategy to streamline searches within enormous sequence spaces, which strongly increases model accuracy at given experimental efforts and is thus likely to have implications for protein engineering far beyond proteases.
Suggested Citation
Lukas Huber & Tim Kucera & Simon Höllerer & Karsten Borgwardt & Sven Panke & Markus Jeschek, 2025.
"Data-driven protease engineering by DNA-recording and epistasis-aware machine learning,"
Nature Communications, Nature, vol. 16(1), pages 1-15, December.
Handle:
RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-60622-7
DOI: 10.1038/s41467-025-60622-7
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-60622-7. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.