Cardiology knowledge assessment of retrieval-augmented open versus proprietary large language models

Cardiology knowledge assessment of retrieval-augmented open versus proprietary large language models

Author

Listed:

Constantine Tarabanis
Shaan Khurshid
Areti Karamanou
Rodo Piperaki
Lucas A Mavromatis
Aris Hatzimemos
Dimitrios Tachmatzidis
Constantinos Bakogiannis
Vassilios Vassilikos
Patrick T Ellinor
Lior Jankelson
Evangelos Kalampokis

Abstract

To evaluate the performance of open-weight and proprietary LLMs, with and without Retrieval-Augmented Generation (RAG), on cardiology board-style questions and benchmark them against the human average. We tested 14 LLMs (6 open-weight, 8 proprietary) on 449 multiple-choice questions from the American College of Cardiology Self-Assessment Program (ACCSAP). Accuracy was measured as percent correct. RAG was implemented using a knowledge base of 123 guideline and textbook documents. The open-weight model DeepSeek R1 achieved the highest accuracy at 86.9% (95% CI: 83.4–89.7%), outperforming proprietary models and the human average of 78%. GPT 4o (80.9%, 95% CI: 77.0–84.2%) and the commercial platform OpenEvidence (81.3%, 95% CI: 77.4–84.7%) demonstrated similar performance. A positive correlation between model size and performance was observed within model families, but across families, substantial variability persisted among models with similar parameter counts. After RAG, all models improved, and open-weight models like Mistral Large 2 (78.0%, 95% CI: 73.9–81.5) performed comparably to proprietary alternatives like GPT 4o. Large language models (LLMs) are increasingly integrated into clinical workflows, yet their performance in cardiovascular medicine remains insufficiently evaluated. Open-weight models can match or exceed proprietary systems in cardiovascular knowledge, with RAG particularly beneficial for smaller models. Given their transparency, configurability, and potential for local deployment, open-weight models, strategically augmented, represent viable, lower-cost alternatives for clinical applications. Open-weight LLMs demonstrate competency in cardiovascular medicine comparable to or exceeding that of proprietary models, with and without RAG depending on the model.Author summary: In this work, we set out to understand how today’s artificial intelligence systems perform when tested on the kind of questions cardiologists face during board examinations. We compared a wide range of large language models, including both “open-weight” models and commercial “proprietary” ones, and also tested whether giving the models access to trusted cardiology textbooks and guidelines could improve their answers. We found that the best open-weight model actually outperformed all of the commercial models we tested, even exceeding the average score of practicing cardiologists. When we gave the models access to medical reference material, nearly all of them improved, with the biggest gains seen in the smaller and weaker models. This shows that careful design and support can allow smaller, more accessible systems to reach high levels of accuracy. Our results suggest that open-weight models, which can be used locally without sending sensitive patient information to outside servers, may be a safe and cost-effective alternative to commercial products. This matters because it could make powerful AI tools more widely available across hospitals and clinics, while also reducing risks related to privacy, transparency, and cost.

Suggested Citation

Constantine Tarabanis & Shaan Khurshid & Areti Karamanou & Rodo Piperaki & Lucas A Mavromatis & Aris Hatzimemos & Dimitrios Tachmatzidis & Constantinos Bakogiannis & Vassilios Vassilikos & Patrick T E, 2026. "Cardiology knowledge assessment of retrieval-augmented open versus proprietary large language models," PLOS Digital Health, Public Library of Science, vol. 5(3), pages 1-11, March.

Handle: RePEc:plo:pdig00:0001029
DOI: 10.1371/journal.pdig.0001029

Download full text from publisher

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0001029. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Cardiology knowledge assessment of retrieval-augmented open versus proprietary large language models

Author

Abstract

Suggested Citation

Download full text from publisher

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data