Designing diverse and high-performance proteins with a large language model in the loop

My bibliography Save this article

Designing diverse and high-performance proteins with a large language model in the loop

Author

Listed:

Carlos A Gomez-Uribe
Japheth Gado
Meiirbek Islamov

Registered:

Abstract

We present a protein engineering approach to directed evolution with machine learning that integrates a new semi-supervised neural network fitness prediction model, Seq2Fitness, and an innovative optimization algorithm, biphasic annealing for diverse and adaptive sequence sampling (BADASS) to design sequences. Seq2Fitness leverages protein language models to predict fitness landscapes, combining evolutionary data with experimental labels, while BADASS efficiently explores these landscapes by dynamically adjusting temperature and mutation energies to prevent premature convergence and to generate diverse high-fitness sequences. Compared to alternative models, Seq2Fitness improves Spearman correlation with experimental fitness measurements, increasing from 0.34 to 0.55 on sequences containing mutations at positions entirely not seen during training. BADASS requires less memory and computation compared to gradient-based Markov Chain Monte Carlo methods, while generating more high-fitness and diverse sequences across two protein families. For both families, 100% of the top 10,000 sequences identified by BADASS exceed the wildtype in predicted fitness, whereas competing methods range from 3% to 99%, often producing far fewer than 10,000 sequences. BADASS also finds higher-fitness sequences at every cutoff (top 1, 100, and 10,000). Additionally, we provide a theoretical framework explaining BADASS’s underlying mechanism and behavior. While we focus on amino acid sequences, BADASS may generalize to other sequence spaces, such as DNA and RNA.Author summary: Designing proteins with enhanced properties is essential for many applications, from industrial enzymes to therapeutic molecules. However, traditional protein engineering methods often fail to explore the vast sequence space effectively, partly due to the rarity of high-fitness sequences. In this work, we introduce BADASS, an optimization algorithm that samples sequences from a probability distribution with mutation energies and a temperature parameter that are updated dynamically, alternating between cooling and heating phases, to discover high-fitness proteins while maintaining sequence diversity. This stands in contrast to traditional approaches like simulated annealing, which often converge on fewer and lower fitness solutions, and gradient-based Markov Chain Monte Carlo (MCMC), also converging on lower fitness solutions and at a significantly higher computational and memory cost. Our approach requires only forward model evaluations and no gradient computations, enabling the rapid design of high-performing proteins that can be validated in the lab, especially when combined with our Seq2Fitness models. BADASS represents a significant advancement in computational protein engineering, opening new possibilities for diverse applications. Our code is publicly available at https://github.com/SoluLearn/BADASS.

Suggested Citation

Carlos A Gomez-Uribe & Japheth Gado & Meiirbek Islamov, 2025. "Designing diverse and high-performance proteins with a large language model in the loop," PLOS Computational Biology, Public Library of Science, vol. 21(6), pages 1-21, June.

Handle: RePEc:plo:pcbi00:1013119
DOI: 10.1371/journal.pcbi.1013119

Download full text from publisher

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1013119. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Designing diverse and high-performance proteins with a large language model in the loop

Author

Abstract

Suggested Citation

Download full text from publisher

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data