IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v12y2021i1d10.1038_s41467-021-26529-9.html
   My bibliography  Save this article

The generative capacity of probabilistic protein sequence models

Author

Listed:
  • Francisco McGee

    (Temple University
    Temple University
    Temple University)

  • Sandro Hauri

    (Temple University
    Temple University)

  • Quentin Novinger

    (Temple University
    Temple University)

  • Slobodan Vucetic

    (Temple University
    Temple University)

  • Ronald M. Levy

    (Temple University
    Temple University
    Temple University
    Temple University)

  • Vincenzo Carnevale

    (Temple University
    Temple University)

  • Allan Haldane

    (Temple University
    Temple University)

Abstract

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the “generative capacity” of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model’s generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE’s lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.

Suggested Citation

  • Francisco McGee & Sandro Hauri & Quentin Novinger & Slobodan Vucetic & Ronald M. Levy & Vincenzo Carnevale & Allan Haldane, 2021. "The generative capacity of probabilistic protein sequence models," Nature Communications, Nature, vol. 12(1), pages 1-14, December.
  • Handle: RePEc:nat:natcom:v:12:y:2021:i:1:d:10.1038_s41467-021-26529-9
    DOI: 10.1038/s41467-021-26529-9
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-021-26529-9
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-021-26529-9?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Alex Hawkins-Hooker & Florence Depardieu & Sebastien Baur & Guillaume Couairon & Arthur Chen & David Bikard, 2021. "Generating functional protein variants with variational autoencoders," PLOS Computational Biology, Public Library of Science, vol. 17(2), pages 1-23, February.
    2. Michael Socolich & Steve W. Lockless & William P. Russ & Heather Lee & Kevin H. Gardner & Rama Ranganathan, 2005. "Evolutionary information for specifying a protein fold," Nature, Nature, vol. 437(7058), pages 512-518, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shunshi Kohyama & Béla P. Frohn & Leon Babl & Petra Schwille, 2024. "Machine learning-aided design and screening of an emergent protein function in synthetic cells," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    2. Yasser Roudi & Sheila Nirenberg & Peter E Latham, 2009. "Pairwise Maximum Entropy Models for Studying Large Biological Systems: When They Can Work and When They Can't," PLOS Computational Biology, Public Library of Science, vol. 5(5), pages 1-18, May.
    3. Erik van Nimwegen, 2016. "Inferring Contacting Residues within and between Proteins: What Do the Probabilities Mean?," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-10, May.
    4. Jennifer L Lahti & Adam P Silverman & Jennifer R Cochran, 2009. "Interrogating and Predicting Tolerated Sequence Diversity in Protein Folds: Application to E. elaterium Trypsin Inhibitor-II Cystine-Knot Miniprotein," PLOS Computational Biology, Public Library of Science, vol. 5(9), pages 1-15, September.
    5. Amir Pandi & David Adam & Amir Zare & Van Tuan Trinh & Stefan L. Schaefer & Marie Burt & Björn Klabunde & Elizaveta Bobkova & Manish Kushwaha & Yeganeh Foroughijabbari & Peter Braun & Christoph Spahn , 2023. "Cell-free biosynthesis combined with deep learning accelerates de novo-development of antimicrobial peptides," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    6. Tiberiu Teşileanu & Lucy J Colwell & Stanislas Leibler, 2015. "Protein Sectors: Statistical Coupling Analysis versus Conservation," PLOS Computational Biology, Public Library of Science, vol. 11(2), pages 1-20, February.
    7. Shou-Wen Wang & Anne-Florence Bitbol & Ned S Wingreen, 2019. "Revealing evolutionary constraints on proteins through sequence analysis," PLOS Computational Biology, Public Library of Science, vol. 15(4), pages 1-16, April.
    8. Hugo Jacquin & Amy Gilson & Eugene Shakhnovich & Simona Cocco & Rémi Monasson, 2016. "Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-18, May.
    9. Cheyenne Ziegler & Jonathan Martin & Claude Sinner & Faruck Morcos, 2023. "Latent generative landscapes as maps of functional diversity in protein sequence space," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    10. Erika Erickson & Japheth E. Gado & Luisana Avilán & Felicia Bratti & Richard K. Brizendine & Paul A. Cox & Raj Gill & Rosie Graham & Dong-Jin Kim & Gerhard König & William E. Michener & Saroj Poudel &, 2022. "Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    11. Xu, Xiu-Lian & Shi, Jin-Xuan & Wang, Jun & Li, Wenfei, 2021. "Long-range correlation and critical fluctuations in coevolution networks of protein sequences," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 562(C).
    12. Umberto Lupo & Damiano Sgarbossa & Anne-Florence Bitbol, 2022. "Protein language models trained on multiple sequence alignments learn phylogenetic relationships," Nature Communications, Nature, vol. 13(1), pages 1-11, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:12:y:2021:i:1:d:10.1038_s41467-021-26529-9. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.