IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v16y2025i1d10.1038_s41467-025-64769-1.html
   My bibliography  Save this article

Quantifying the reasoning abilities of LLMs on clinical cases

Author

Listed:
  • Pengcheng Qiu

    (Shanghai Jiao Tong University
    Shanghai Artificial Intelligence Laboratory)

  • Chaoyi Wu

    (Shanghai Jiao Tong University
    Shanghai Artificial Intelligence Laboratory)

  • Shuyu Liu

    (Shanghai Jiao Tong University)

  • Yanjie Fan

    (Xin Hua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine)

  • Weike Zhao

    (Shanghai Jiao Tong University
    Shanghai Artificial Intelligence Laboratory)

  • Zhuoxia Chen

    (China Mobile Communications Group Shanghai Co., Ltd.)

  • Hongfei Gu

    (China Mobile Communications Group Shanghai Co., Ltd.)

  • Chuanjin Peng

    (China Mobile Communications Group Shanghai Co., Ltd.)

  • Ya Zhang

    (Shanghai Jiao Tong University
    Shanghai Artificial Intelligence Laboratory)

  • Yanfeng Wang

    (Shanghai Jiao Tong University
    Shanghai Artificial Intelligence Laboratory)

  • Weidi Xie

    (Shanghai Jiao Tong University
    Shanghai Artificial Intelligence Laboratory)

Abstract

Recent advances in reasoning-enhanced large language models (LLMs) show promise, yet their application in professional medicine, especially the evaluation of their reasoning process, remains underexplored. We present MedR-Bench, a benchmark of 1453 structured patient cases with reference reasoning derived from clinical case reports, spanning 13 body systems and 10 specialties across common and rare diseases. Our evaluation framework covers three stages of care: examination recommendation, diagnostic decision-making, and treatment planning. To assess reasoning quality, we develop the Reasoning Evaluator, an automated scorer of written reasoning along efficiency, factual accuracy, and completeness. We evaluate seven state-of-the-art reasoning LLMs. Here we show that current models exceed 85% accuracy on simple diagnostic tasks when sufficient examination results are available, but performance drops on examination recommendation and treatment planning. Reasoning is generally factual, yet critical steps are often missing. Open-source models are closing the gap with proprietary systems, highlighting potential for more accessible, equitable clinical AI.

Suggested Citation

  • Pengcheng Qiu & Chaoyi Wu & Shuyu Liu & Yanjie Fan & Weike Zhao & Zhuoxia Chen & Hongfei Gu & Chuanjin Peng & Ya Zhang & Yanfeng Wang & Weidi Xie, 2025. "Quantifying the reasoning abilities of LLMs on clinical cases," Nature Communications, Nature, vol. 16(1), pages 1-14, December.
  • Handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-64769-1
    DOI: 10.1038/s41467-025-64769-1
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-025-64769-1
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-025-64769-1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Karan Singhal & Shekoofeh Azizi & Tao Tu & S. Sara Mahdavi & Jason Wei & Hyung Won Chung & Nathan Scales & Ajay Tanwani & Heather Cole-Lewis & Stephen Pfohl & Perry Payne & Martin Seneviratne & Paul G, 2023. "Publisher Correction: Large language models encode clinical knowledge," Nature, Nature, vol. 620(7973), pages 19-19, August.
    2. Pengcheng Qiu & Chaoyi Wu & Xiaoman Zhang & Weixiong Lin & Haicheng Wang & Ya Zhang & Yanfeng Wang & Weidi Xie, 2024. "Towards building multilingual language model for medicine," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    3. Daya Guo & Dejian Yang & Haowei Zhang & Junxiao Song & Peiyi Wang & Qihao Zhu & Runxin Xu & Ruoyu Zhang & Shirong Ma & Xiao Bi & Xiaokang Zhang & Xingkai Yu & Yu Wu & Z. F. Wu & Zhibin Gou & Zhihong S, 2025. "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning," Nature, Nature, vol. 645(8081), pages 633-638, September.
    4. Karan Singhal & Shekoofeh Azizi & Tao Tu & S. Sara Mahdavi & Jason Wei & Hyung Won Chung & Nathan Scales & Ajay Tanwani & Heather Cole-Lewis & Stephen Pfohl & Perry Payne & Martin Seneviratne & Paul G, 2023. "Large language models encode clinical knowledge," Nature, Nature, vol. 620(7972), pages 172-180, August.
    5. Chaoyi Wu & Xiaoman Zhang & Ya Zhang & Hui Hui & Yanfeng Wang & Weidi Xie, 2025. "Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data," Nature Communications, Nature, vol. 16(1), pages 1-22, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Maxime Griot & Coralie Hemptinne & Jean Vanderdonckt & Demet Yuksel, 2025. "Large Language Models lack essential metacognition for reliable medical reasoning," Nature Communications, Nature, vol. 16(1), pages 1-10, December.
    2. Arslon Ruziboev & Dilmurod Turimov & Jiyoun Kim & Wooseong Kim, 2025. "Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches," Mathematics, MDPI, vol. 13(18), pages 1-22, September.
    3. Ali Nemati & Mohammad Assadi Shalmani & Qiang Lu & Jake Luo, 2025. "Benchmarking Large Language Models from Open and Closed Source Models to Apply Data Annotation for Free-Text Criteria in Healthcare," Future Internet, MDPI, vol. 17(4), pages 1-27, March.
    4. Cheng-Yi Li & Kao-Jung Chang & Cheng-Fu Yang & Hsin-Yu Wu & Wenting Chen & Hritik Bansal & Ling Chen & Yi-Ping Yang & Yu-Chun Chen & Shih-Pin Chen & Shih-Jen Chen & Jiing-Feng Lirng & Kai-Wei Chang & , 2025. "Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation," Nature Communications, Nature, vol. 16(1), pages 1-14, December.
    5. Tingmingke Lu, 2025. "Maximum Hallucination Standards for Domain-Specific Large Language Models," Papers 2503.05481, arXiv.org.
    6. Zheng, Shuwen & Pan, Kai & Liu, Jie & Chen, Yunxia, 2024. "Empirical study on fine-tuning pre-trained large language models for fault diagnosis of complex systems," Reliability Engineering and System Safety, Elsevier, vol. 252(C).
    7. Xiangru Tang & Qiao Jin & Kunlun Zhu & Tongxin Yuan & Yichi Zhang & Wangchunshu Zhou & Meng Qu & Yilun Zhao & Jian Tang & Zhuosheng Zhang & Arman Cohan & Dov Greenbaum & Zhiyong Lu & Mark Gerstein, 2025. "Risks of AI scientists: prioritizing safeguarding over autonomy," Nature Communications, Nature, vol. 16(1), pages 1-11, December.
    8. Zhou, Zhen & Gu, Ziyuan & Qu, Xiaobo & Liu, Pan & Liu, Zhiyuan & Yu, Wenwu, 2024. "Urban mobility foundation model: A literature review and hierarchical perspective," Transportation Research Part E: Logistics and Transportation Review, Elsevier, vol. 192(C).
    9. Qingyu Chen & Yan Hu & Xueqing Peng & Qianqian Xie & Qiao Jin & Aidan Gilson & Maxwell B. Singer & Xuguang Ai & Po-Ting Lai & Zhizheng Wang & Vipina K. Keloth & Kalpana Raja & Jimin Huang & Huan He & , 2025. "Benchmarking large language models for biomedical natural language processing applications and recommendations," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    10. Zhenjia Chen & Zhenyuan Lin & Ji Yang & Cong Chen & Di Liu & Liuting Shan & Yuanyuan Hu & Tailiang Guo & Huipeng Chen, 2024. "Cross-layer transmission realized by light-emitting memristor for constructing ultra-deep neural network with transfer learning ability," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    11. Yujin Oh & Sangjoon Park & Hwa Kyung Byun & Yeona Cho & Ik Jae Lee & Jin Sung Kim & Jong Chul Ye, 2024. "LLM-driven multimodal target volume contouring in radiation oncology," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    12. Chen Gao & Xiaochong Lan & Nian Li & Yuan Yuan & Jingtao Ding & Zhilun Zhou & Fengli Xu & Yong Li, 2024. "Large language models empowered agent-based modeling and simulation: a survey and perspectives," Humanities and Social Sciences Communications, Palgrave Macmillan, vol. 11(1), pages 1-24, December.
    13. Juexiao Zhou & Xiaonan He & Liyuan Sun & Jiannan Xu & Xiuying Chen & Yuetan Chu & Longxi Zhou & Xingyu Liao & Bin Zhang & Shawn Afvari & Xin Gao, 2024. "Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    14. Qin, Hongyi & Zhu, Yifan & Jiang, Yan & Luo, Siqi & Huang, Cui, 2024. "Examining the impact of personalization and carefulness in AI-generated health advice: Trust, adoption, and insights in online healthcare consultations experiments," Technology in Society, Elsevier, vol. 79(C).
    15. Ching-Nam Hang & Pei-Duo Yu & Roberto Morabito & Chee-Wei Tan, 2024. "Large Language Models Meet Next-Generation Networking Technologies: A Review," Future Internet, MDPI, vol. 16(10), pages 1-29, October.
    16. Chao-Chun Hsu & Ziad Obermeyer & Chenhao Tan, 2025. "A machine learning model using clinical notes to identify physician fatigue," Nature Communications, Nature, vol. 16(1), pages 1-10, December.
    17. Yang Zhao & Pu Wang & Yibo Zhao & Hongru Du & Hao Frank Yang, 2025. "SafeTraffic Copilot: adapting large language models for trustworthy traffic safety assessments and decision interventions," Nature Communications, Nature, vol. 16(1), pages 1-17, December.
    18. Ofir Ben Shoham & Nadav Rappoport, 2024. "CPLLM: Clinical prediction with large language models," PLOS Digital Health, Public Library of Science, vol. 3(12), pages 1-15, December.
    19. Sheng Wang & Fangyuan Zhao & Dechao Bu & Yunwei Lu & Ming Gong & Hongjie Liu & Zhaohui Yang & Xiaoxi Zeng & Zhiyuan Yuan & Baoping Wan & Jingbo Sun & Yang Wu & Lianhe Zhao & Xirun Wan & Wei Huang & Ta, 2025. "LINS: A general medical Q&A framework for enhancing the quality and credibility of LLM-generated responses," Nature Communications, Nature, vol. 16(1), pages 1-20, December.
    20. Venkat Ram Reddy Ganuthula & Krishna Kumar Balaraman, 2025. "The Paradox of Professional Input: How Expert Collaboration with AI Systems Shapes Their Future Value," Papers 2504.12654, arXiv.org.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-64769-1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.