IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v16y2025i1d10.1038_s41467-025-62308-6.html
   My bibliography  Save this article

RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints

Author

Listed:
  • Yafeng Deng

    (Tsinghua University
    Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Xinda Zhao

    (Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Hanyu Sun

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Yu Chen

    (Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Xiaorui Wang

    (Zhejiang University)

  • Xi Xue

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Liangning Li

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Jianfei Song

    (Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Chang-Yu Hsieh

    (Hangzhou Carbonsilicon AI Technology Co., Ltd
    Zhejiang University)

  • Tingjun Hou

    (Hangzhou Carbonsilicon AI Technology Co., Ltd
    Zhejiang University)

  • Xiandao Pan

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Taghrid Saad Alomar

    (Princess Nourah bint Abdulrahman University)

  • Xiangyang Ji

    (Tsinghua University)

  • Xiaojian Wang

    (Hangzhou Carbonsilicon AI Technology Co., Ltd
    Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

Abstract

Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models.

Suggested Citation

  • Yafeng Deng & Xinda Zhao & Hanyu Sun & Yu Chen & Xiaorui Wang & Xi Xue & Liangning Li & Jianfei Song & Chang-Yu Hsieh & Tingjun Hou & Xiandao Pan & Taghrid Saad Alomar & Xiangyang Ji & Xiaojian Wang, 2025. "RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints," Nature Communications, Nature, vol. 16(1), pages 1-14, December.
  • Handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-62308-6
    DOI: 10.1038/s41467-025-62308-6
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-025-62308-6
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-025-62308-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Yu Wang & Chao Pang & Yuzhe Wang & Junru Jin & Jingjie Zhang & Xiangxiang Zeng & Ran Su & Quan Zou & Leyi Wei, 2023. "Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    2. Igor V. Tetko & Pavel Karpov & Ruud Deursen & Guillaume Godin, 2020. "State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis," Nature Communications, Nature, vol. 11(1), pages 1-11, December.
    3. Yuqiang Han & Xiaoyang Xu & Chang-Yu Hsieh & Keyan Ding & Hongxia Xu & Renjun Xu & Tingjun Hou & Qiang Zhang & Huajun Chen, 2024. "Retrosynthesis prediction with an iterative string editing model," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    4. Marwin H. S. Segler & Mike Preuss & Mark P. Waller, 2018. "Planning chemical syntheses with deep neural networks and symbolic AI," Nature, Nature, vol. 555(7698), pages 604-610, March.
    5. Alessandro Tibo & Jiazhen He & Jon Paul Janet & Eva Nittinger & Ola Engkvist, 2024. "Exhaustive local chemical space exploration using a transformer model," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    6. Weihe Zhong & Ziduo Yang & Calvin Yu-Chian Chen, 2023. "Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yuqiang Han & Xiaoyang Xu & Chang-Yu Hsieh & Keyan Ding & Hongxia Xu & Renjun Xu & Tingjun Hou & Qiang Zhang & Huajun Chen, 2024. "Retrosynthesis prediction with an iterative string editing model," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    2. Yu Shee & Haote Li & Pengpeng Zhang & Andrea M. Nikolic & Wenxin Lu & H. Ray Kelly & Vidhyadhar Manee & Sanil Sreekumar & Frederic G. Buono & Jinhua J. Song & Timothy R. Newhouse & Victor S. Batista, 2024. "Site-specific template generative approach for retrosynthetic planning," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    3. Xuefeng Zhang & Haowei Lin & Muhan Zhang & Yuan Zhou & Jianzhu Ma, 2025. "A data-driven group retrosynthesis planning model inspired by neurosymbolic programming," Nature Communications, Nature, vol. 16(1), pages 1-17, December.
    4. Umit V. Ucak & Islambek Ashyrmamatov & Junsu Ko & Juyong Lee, 2022. "Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    5. Yu Wang & Chao Pang & Yuzhe Wang & Junru Jin & Jingjie Zhang & Xiangxiang Zeng & Ran Su & Quan Zou & Leyi Wei, 2023. "Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    6. Weihe Zhong & Ziduo Yang & Calvin Yu-Chian Chen, 2023. "Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    7. Jia-Min Lu & Hui-Feng Wang & Qi-Hang Guo & Jian-Wei Wang & Tong-Tong Li & Ke-Xin Chen & Meng-Ting Zhang & Jian-Bo Chen & Qian-Nuan Shi & Yi Huang & Shao-Wen Shi & Guang-Yong Chen & Jian-Zhang Pan & Zh, 2024. "Roboticized AI-assisted microfluidic photocatalytic synthesis and screening up to 10,000 reactions per day," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    8. Naudé, Wim, 2020. "Artificial Intelligence against COVID-19: An Early Review," IZA Discussion Papers 13110, Institute of Labor Economics (IZA).
    9. Xu Liu & Yihan Zhang & Yifan Xie & Ledu Wang & Liyu Gan & Jialei Li & Jiahe Li & Hongli Zhang & Linjiang Chen & Weiwei Shang & Jun Jiang & Gang Zou, 2025. "Design of circularly polarized phosphorescence materials guided by transfer learning," Nature Communications, Nature, vol. 16(1), pages 1-10, December.
    10. Mingyang Wang & Shuai Li & Jike Wang & Odin Zhang & Hongyan Du & Dejun Jiang & Zhenxing Wu & Yafeng Deng & Yu Kang & Peichen Pan & Dan Li & Xiaorui Wang & Xiaojun Yao & Tingjun Hou & Chang-Yu Hsieh, 2024. "ClickGen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    11. Wang, Zixuan & Chen, Zijian & Wang, Boyuan & Wu, Chuang & Zhou, Chao & Peng, Yang & Zhang, Xinyu & Ni, Zongming & Chung, Chi-yung & Chan, Ching-chuen & Yang, Jian & Zhao, Haitao, 2025. "Digital manufacturing of perovskite materials and solar cells," Applied Energy, Elsevier, vol. 377(PB).
    12. Lei Fang & Junren Li & Ming Zhao & Li Tan & Jian-Guang Lou, 2023. "Single-step retrosynthesis prediction by leveraging commonly preserved substructures," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    13. Nina Miolane, 2025. "The fifth era of science: Artificial scientific intelligence," PLOS Biology, Public Library of Science, vol. 23(6), pages 1-4, June.
    14. Huaisheng Tu & Haotian Liu & Tuqiang Pan & Wuping Xie & Zihao Ma & Fan Zhang & Pengbai Xu & Leiming Wu & Ou Xu & Yi Xu & Yuwen Qin, 2025. "Deep empirical neural network for optical phase retrieval over a scattering medium," Nature Communications, Nature, vol. 16(1), pages 1-9, December.
    15. Mochen Liao & Kai Lan & Yuan Yao, 2022. "Sustainability implications of artificial intelligence in the chemical industry: A conceptual framework," Journal of Industrial Ecology, Yale University, vol. 26(1), pages 164-182, February.
    16. Shingo Harada & Hiroki Takenaka & Tsubasa Ito & Haruki Kanda & Tetsuhiro Nemoto, 2024. "Valence-isomer selective cycloaddition reaction of cycloheptatrienes-norcaradienes," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    17. Jianbo Qiao & Junru Jin & Ding Wang & Saisai Teng & Junyu Zhang & Xuetong Yang & Yuhang Liu & Yu Wang & Lizhen Cui & Quan Zou & Ran Su & Leyi Wei, 2025. "A self-conformation-aware pre-training framework for molecular property prediction with substructure interpretability," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    18. Wenhao Gao & Priyanka Raghavan & Connor W. Coley, 2022. "Autonomous platforms for data-driven organic synthesis," Nature Communications, Nature, vol. 13(1), pages 1-4, December.
    19. Debesh Mishra & Biswajit Mohapatra & Abhaya Sanatan Satpathy & Kamalakanta Muduli & Binayak Mishra & Swagatika Mishra & Upma Paliwal, 2024. "The pandemic COVID-19 and associated challenges with implementation of artificial intelligence (AI) in Indian agriculture," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 15(6), pages 2715-2729, June.
    20. Itai Levin & Mengjie Liu & Christopher A. Voigt & Connor W. Coley, 2022. "Merging enzymatic and synthetic chemistry with computational synthesis planning," Nature Communications, Nature, vol. 13(1), pages 1-14, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-62308-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.