IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0325803.html
   My bibliography  Save this article

Evaluation of large language models as a diagnostic tool for medical learners and clinicians using advanced prompting techniques

Author

Listed:
  • Karolina Gaebe
  • Benjamin van der Woerd

Abstract

Background: Large language models (LLMs) have demonstrated capabilities in natural language processing and critical reasoning. Studies investigating their potential use as healthcare diagnostic tools have largely relied on proprietary models like ChatGPT and have not explored the application of advanced prompt engineering techniques. This study aims to evaluate the diagnostic accuracy of three open-source LLMs and the role of prompt engineering using clinical scenarios. Methods: We analyzed the performance of three open-source LLMs—llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768—using advanced prompt engineering when answering Medscape Clinical Challenge questions. Responses were recorded and evaluated for correctness, accuracy, precision, specificity, and sensitivity. A sensitivity analysis was conducted presenting the three LLMs with basic prompting challenge questions and excluding cases with visual assets. Results were compared with previously published performance data on GPT-3.5. Results: Llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768 achieved correct responses in 79%, 65%, and 62% of cases, respectively, outperforming GPT-3.5 (74%). Diagnostic accuracy, precision, sensitivity, and specificity responses all outperformed those previously reported for GPT-3.5. Results generated using advanced prompting strategies were superior to those based on basic prompting. Sensitivity analysis revealed similar trends when cases with visual assets were excluded. Discussion: Using advanced prompting techniques, LLMs can generate clinically accurate responses. The study highlights the limitations of proprietary models like ChatGPT, particularly in terms of accessibility and reproducibility due to version deprecation. Future research should employ prompt engineering techniques and prioritize the use of open-source models to ensure research replicability.

Suggested Citation

  • Karolina Gaebe & Benjamin van der Woerd, 2025. "Evaluation of large language models as a diagnostic tool for medical learners and clinicians using advanced prompting techniques," PLOS ONE, Public Library of Science, vol. 20(8), pages 1-9, August.
  • Handle: RePEc:plo:pone00:0325803
    DOI: 10.1371/journal.pone.0325803
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0325803
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0325803&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0325803?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0325803. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.