IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v16y2025i1d10.1038_s41467-025-62308-6.html
   My bibliography  Save this article

RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints

Author

Listed:
  • Yafeng Deng

    (Tsinghua University
    Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Xinda Zhao

    (Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Hanyu Sun

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Yu Chen

    (Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Xiaorui Wang

    (Zhejiang University)

  • Xi Xue

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Liangning Li

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Jianfei Song

    (Hangzhou Carbonsilicon AI Technology Co., Ltd)

  • Chang-Yu Hsieh

    (Hangzhou Carbonsilicon AI Technology Co., Ltd
    Zhejiang University)

  • Tingjun Hou

    (Hangzhou Carbonsilicon AI Technology Co., Ltd
    Zhejiang University)

  • Xiandao Pan

    (Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

  • Taghrid Saad Alomar

    (Princess Nourah bint Abdulrahman University)

  • Xiangyang Ji

    (Tsinghua University)

  • Xiaojian Wang

    (Hangzhou Carbonsilicon AI Technology Co., Ltd
    Peking Union Medical College and Chinese Academy of Medical Sciences
    Peking Union Medical College and Chinese Academy of Medical Sciences)

Abstract

Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models.

Suggested Citation

  • Yafeng Deng & Xinda Zhao & Hanyu Sun & Yu Chen & Xiaorui Wang & Xi Xue & Liangning Li & Jianfei Song & Chang-Yu Hsieh & Tingjun Hou & Xiandao Pan & Taghrid Saad Alomar & Xiangyang Ji & Xiaojian Wang, 2025. "RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints," Nature Communications, Nature, vol. 16(1), pages 1-14, December.
  • Handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-62308-6
    DOI: 10.1038/s41467-025-62308-6
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-025-62308-6
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-025-62308-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-62308-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.