IDEAS home Printed from https://ideas.repec.org/a/plo/pdig00/0001072.html

Multidimensional evaluation of large language models on the AAP in-service examination: Assessing accuracy, calibration, and citation reliability

Author

Listed:
  • Prita Abhay Dhaimade
  • Robin Henderson

Abstract

Large language models (LLMs) have demonstrated rapid advancements in natural language understanding and generation, prompting their integration into biomedical research, clinical practice, and professional education. However, systematic evaluation of LLMs in specialty-specific domains such as dentistry and periodontology remains limited, particularly regarding multidimensional performance metrics. This study conducted a comprehensive assessment of commercially available LLMs — GPT-4.0, GPT-5.0, and Claude Sonnet 4.0 — on the American Academy of Periodontology In-Service Examination, focusing on response accuracy, self-assessed confidence calibration, citation validity, and hallucination prevalence. Models were evaluated on the 2024 AAP In-Service Examination (331 questions) using two formats: Full Test (all questions at once) and Individual Question (one at a time). Prompts were standardized; models selected answers, and GPT-5.0 and Claude Sonnet 4.0 also provided confidence ratings and citations. Citation validity was assessed using a human-in-the-loop protocol with expert review. Statistical analyses included chi-square, McNemar’s, and logistic regression to assess accuracy, question fatigue, confidence calibration, and citation reliability. LLMs achieved high overall accuracy (78–87%), with the Individual Question format consistently yielding higher scores than Full Test, though differences were not statistically significant. Accuracy was highest in fact-dense domains (biochemistry, physiology, microbiology) and lowest in integrative domains (diagnosis, therapy). Significant question fatigue was observed in GPT-5.0 Full Test mode (OR = 0.997, p = 0.035) but not in Individual Question mode. Confidence scores predicted accuracy, with the strongest calibration in Individual Question mode. Citation analysis revealed frequent hallucinations, mostly critically erroneous, and citation validity was independent of answer accuracy. LLMs show promise as adjunctive tools for periodontal education, but their outputs, especially for complex reasoning and citations require rigorous human review to ensure accuracy and safety.Author summary: Artificial intelligence chatbots are rapidly entering medical education, yet we lack comprehensive understanding of their reliability when students depend on them for learning. We developed a multidimensional evaluation framework to systematically assess AI performance beyond simple accuracy, examining how these systems behave across different medical topics, question types, and presentation formats. Using 331 real dental examination questions, we tested three major AI systems, analyzing not only correctness but also confidence calibration, whether AI confidence levels match actual accuracy and implementing human-in-the-loop verification to check if cited sources actually exist. Our findings highlight critical vulnerabilities in current AI systems. Most alarmingly, these chatbots fabricated nearly half of their citations while maintaining unwavering confidence in both correct and incorrect responses. This combination of overconfidence and misinformation means students cannot distinguish reliable from unreliable AI responses. Additionally, we documented progressive performance decline during sequential questioning, similar to human cognitive fatigue. While we know AI systems generate rather than retrieve information, our research demonstrates the real-world consequences of this limitation. As artificial intelligence integrates into education, healthcare diagnostics, and insurance decisions, these findings underscore the urgent need for better evaluation frameworks and user education about AI limitations.

Suggested Citation

  • Prita Abhay Dhaimade & Robin Henderson, 2026. "Multidimensional evaluation of large language models on the AAP in-service examination: Assessing accuracy, calibration, and citation reliability," PLOS Digital Health, Public Library of Science, vol. 5(5), pages 1-19, May.
  • Handle: RePEc:plo:pdig00:0001072
    DOI: 10.1371/journal.pdig.0001072
    as

    Download full text from publisher

    File URL: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0001072
    Download Restriction: no

    File URL: https://journals.plos.org/digitalhealth/article/file?id=10.1371/journal.pdig.0001072&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pdig.0001072?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0001072. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.