IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v16y2025i1d10.1038_s41467-025-58588-7.html
   My bibliography  Save this article

CodonTransformer: a multispecies codon optimizer using context-aware neural networks

Author

Listed:
  • Adibvafa Fallahpour

    (Vector Institute for Artificial Intelligence
    University of Toronto Scarborough; Department of Biological Science)

  • Vincent Gureghian

    (Quantitative and Synthetic Biology
    Institut de Biologie Paris-Seine)

  • Guillaume J. Filion

    (University of Toronto Scarborough; Department of Biological Science)

  • Ariel B. Lindner

    (Quantitative and Synthetic Biology
    Institut de Biologie Paris-Seine
    Biofoundry Alliance Sorbonne Université)

  • Amir Pandi

    (Quantitative and Synthetic Biology
    Institut de Biologie Paris-Seine
    Biofoundry Alliance Sorbonne Université)

Abstract

Degeneracy in the genetic code allows many possible DNA sequences to encode the same protein. Optimizing codon usage within a sequence to meet organism-specific preferences faces combinatorial explosion. Nevertheless, natural sequences optimized through evolution provide a rich source of data for machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all domains of life. The model demonstrates context-awareness thanks to its Transformers architecture and to our sequence representation strategy that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with minimum negative cis-regulatory elements. This work introduces the strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a codon optimization framework with a customizable open-access model and a user-friendly Google Colab interface.

Suggested Citation

  • Adibvafa Fallahpour & Vincent Gureghian & Guillaume J. Filion & Ariel B. Lindner & Amir Pandi, 2025. "CodonTransformer: a multispecies codon optimizer using context-aware neural networks," Nature Communications, Nature, vol. 16(1), pages 1-12, December.
  • Handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-58588-7
    DOI: 10.1038/s41467-025-58588-7
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-025-58588-7
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-025-58588-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Joseph L. Watson & David Juergens & Nathaniel R. Bennett & Brian L. Trippe & Jason Yim & Helen E. Eisenach & Woody Ahern & Andrew J. Borst & Robert J. Ragotte & Lukas F. Milles & Basile I. M. Wicky & , 2023. "De novo design of protein structure and function with RFdiffusion," Nature, Nature, vol. 620(7976), pages 1089-1100, August.
    2. Noelia Ferruz & Steffen Schmidt & Birte Höcker, 2022. "ProtGPT2 is a deep unsupervised language model for protein design," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    3. Giorgino, Toni, 2009. "Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 31(i07).
    4. John B. Ingraham & Max Baranov & Zak Costello & Karl W. Barber & Wujie Wang & Ahmed Ismail & Vincent Frappier & Dana M. Lord & Christopher Ng-Thow-Hing & Erik R. Van Vlack & Shan Tie & Vincent Xue & S, 2023. "Illuminating protein space with a programmable generative model," Nature, Nature, vol. 623(7989), pages 1070-1078, November.
    5. Nico J Claassens & Melvin F Siliakus & Sebastiaan K Spaans & Sjoerd C A Creutzburg & Bart Nijsse & Peter J Schaap & Tessa E F Quax & John van der Oost, 2017. "Improving heterologous membrane protein production in Escherichia coli by combining transcriptional tuning and codon usage algorithms," PLOS ONE, Public Library of Science, vol. 12(9), pages 1-17, September.
    6. Mian Zhou & Jinhu Guo & Joonseok Cha & Michael Chae & She Chen & Jose M. Barral & Matthew S. Sachs & Yi Liu, 2013. "Non-optimal codon usage affects expression, structure and function of clock protein FRQ," Nature, Nature, vol. 495(7439), pages 111-115, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Timothy Atkinson & Thomas D. Barrett & Scott Cameron & Bora Guloglu & Matthew Greenig & Charlie B. Tan & Louis Robinson & Alex Graves & Liviu Copoiu & Alexandre Laterre, 2025. "Protein sequence modelling with Bayesian flow networks," Nature Communications, Nature, vol. 16(1), pages 1-14, December.
    2. Sophia Vincoff & Shrey Goel & Kseniia Kholina & Rishab Pulugurta & Pranay Vure & Pranam Chatterjee, 2025. "FusOn-pLM: a fusion oncoprotein-specific language model via adjusted rate masking," Nature Communications, Nature, vol. 16(1), pages 1-11, December.
    3. Lucien F. Krapp & Fernando A. Meireles & Luciano A. Abriata & Jean Devillard & Sarah Vacle & Maria J. Marcaida & Matteo Dal Peraro, 2024. "Context-aware geometric deep learning for protein sequence design," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    4. Gi Bae Kim & Ha Rim Kim & Sang Yup Lee, 2025. "Comprehensive evaluation of the capacities of microbial cell factories," Nature Communications, Nature, vol. 16(1), pages 1-15, December.
    5. William Mo & Christopher A. Vaiana & Chris J. Myers, 2024. "The need for adaptability in detection, characterization, and attribution of biosecurity threats," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    6. Amato, Umberto & Antoniadis, Anestis & De Feis, Italia & Goude, Yannig & Lagache, Audrey, 2021. "Forecasting high resolution electricity demand data with additive models including smooth and jagged components," International Journal of Forecasting, Elsevier, vol. 37(1), pages 171-185.
    7. Mastroeni, Loretta & Mazzoccoli, Alessandro & Quaresima, Greta & Vellucci, Pierluigi, 2021. "Decoupling and recoupling in the crude oil price benchmarks: An investigation of similarity patterns," Energy Economics, Elsevier, vol. 94(C).
    8. Palistha Shrestha & Jeevan Kandel & Hilal Tayara & Kil To Chong, 2024. "Post-translational modification prediction via prompt-based fine-tuning of a GPT-2 model," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    9. Christoph J. Borner & Ingo Hoffmann & Jonas Krettek & Lars M. Kurzinger & Tim Schmitz, 2021. "Bitcoin: Like a Satellite or Always Hardcore? A Core-Satellite Identification in the Cryptocurrency Market," Papers 2105.12336, arXiv.org.
    10. Wei Yang & Derrick R. Hicks & Agnidipta Ghosh & Tristin A. Schwartze & Brian Conventry & Inna Goreshnik & Aza Allen & Samer F. Halabiya & Chan Johng Kim & Cynthia S. Hinck & David S. Lee & Asim K. Ber, 2025. "Design of high-affinity binders to immune modulating receptors for cancer immunotherapy," Nature Communications, Nature, vol. 16(1), pages 1-12, December.
    11. Vatsa, Puneet & Miljkovic, Tatjana & Miljkovic, Dragan, 2024. "Price discovery redux—Analyzing energy spot and futures prices using a dynamic programming approach," Energy Economics, Elsevier, vol. 140(C).
    12. Hanjo Odendaal & Monique Reid & Johann F. Kirsten, 2020. "Media‐Based Sentiment Indices as an Alternative Measure of Consumer Confidence," South African Journal of Economics, Economic Society of South Africa, vol. 88(4), pages 409-434, December.
    13. Krzysztof Dmytrow & Beata Bieszk-Stolorz, 2021. "Comparison of changes in the labour markets of post-communist countries with other EU member states," Equilibrium. Quarterly Journal of Economics and Economic Policy, Institute of Economic Research, vol. 16(4), pages 741-764, December.
    14. Yangchen Di & Mingyue Lu & Min Chen & Zhangjian Chen & Zaiyang Ma & Manzhu Yu, 2022. "A quantitative method for the similarity assessment of typhoon tracks," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 112(1), pages 587-602, May.
    15. Meaghan S. Jankowski & Daniel Griffith & Divya G. Shastry & Jacqueline F. Pelham & Garrett M. Ginell & Joshua Thomas & Pankaj Karande & Alex S. Holehouse & Jennifer M. Hurley, 2024. "Disordered clock protein interactions and charge blocks turn an hourglass into a persistent circadian oscillator," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    16. Ying Huang & Chenyang Xue & Ruiqian Bu & Cang Wu & Jiachen Li & Jinqiu Zhang & Jinyu Chen & Zhaoying Shi & Yonglong Chen & Yong Wang & Zhongmin Liu, 2024. "Inhibition and transport mechanisms of the ABC transporter hMRP5," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    17. De Gregorio, Alessandro & Maria Iacus, Stefano, 2010. "Clustering of discretely observed diffusion processes," Computational Statistics & Data Analysis, Elsevier, vol. 54(2), pages 598-606, February.
    18. Szczepocki Piotr, 2019. "Clustering Companies Listed on the Warsaw Stock Exchange According to Time-Varying Beta," Econometrics. Advances in Applied Data Analysis, Sciendo, vol. 23(2), pages 63-79, June.
    19. Corey Ducharme & Bruno Agard & Martin Trépanier, 2024. "Improving demand forecasting for customers with missing downstream data in intermittent demand supply chains with supervised multivariate clustering," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 43(5), pages 1661-1681, August.
    20. Sokhna Dieng & Pierre Michel & Abdoulaye Guindo & Kankoe Sallah & El-Hadj Ba & Badara Cissé & Maria Patrizia Carrieri & Cheikh Sokhna & Paul Milligan & Jean Gaudart, 2020. "Application of Functional Data Analysis to Identify Patterns of Malaria Incidence, to Guide Targeted Control Strategies," IJERPH, MDPI, vol. 17(11), pages 1-23, June.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-58588-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.