Author
Listed:
- Juan Andres Medina Florez
- Shaina Raza
- Rashida Lynn Ansell
- Zahra Shakeri
- Brendan T Smith
- Elham Dolatabadi
Abstract
Understanding disparities in the prevalence of Post COVID-19 Condition (PCC) amongst vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging natural language processing (NLP) techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses was developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective.Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity, achieving a macro F1-score of 0.72 and macro AUC of 0.99 on a held-out generalization set. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and under-representation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like “Experienced violence or abuse” and “Has medical insurance” had high entailment rates (82.4%–80.3%), while attributes such as “Is female-identifying,” “Is married,” and “Has a terminal condition” exhibited high contradiction rates (70.8%–98.5%).Our results highlight the effectiveness of transformer-based NER in extracting SDOH information from case reports. However, the findings also expose critical gaps in the representation of marginalized groups within PCC-related academic case reports, e.g., across gender, insurance status, and age. This work underscores the need for standardized SDOH documentation and inclusive reporting practices to enable more equitable research and inform future health policy and AI model development.
Suggested Citation
Juan Andres Medina Florez & Shaina Raza & Rashida Lynn Ansell & Zahra Shakeri & Brendan T Smith & Elham Dolatabadi, 2025.
"Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition,"
PLOS ONE, Public Library of Science, vol. 20(7), pages 1-22, July.
Handle:
RePEc:plo:pone00:0326668
DOI: 10.1371/journal.pone.0326668
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0326668. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.