IDEAS home Printed from https://ideas.repec.org/a/plo/pdig00/0000642.html
   My bibliography  Save this article

Conceptualizing bias in EHR data: A case study in performance disparities by demographic subgroups for a pediatric obesity incidence classifier

Author

Listed:
  • Elizabeth A Campbell
  • Saurav Bose
  • Aaron J Masino

Abstract

Electronic Health Records (EHRs) are increasingly used to develop machine learning models in predictive medicine. There has been limited research on utilizing machine learning methods to predict childhood obesity and related disparities in classifier performance among vulnerable patient subpopulations. In this work, classification models are developed to recognize pediatric obesity using temporal condition patterns obtained from patient EHR data in a U.S. study population. We trained four machine learning algorithms (Logistic Regression, Random Forest, Gradient Boosted Trees, and Neural Networks) to classify cases and controls as obesity positive or negative, and optimized hyperparameter settings through a bootstrapping methodology. To assess the classifiers for bias, we studied model performance by population subgroups then used permutation analysis to identify the most predictive features for each model and the demographic characteristics of patients with these features. Mean AUC-ROC values were consistent across classifiers, ranging from 0.72–0.80. Some evidence of bias was identified, although this was through the models performing better for minority subgroups (African Americans and patients enrolled in Medicaid). Permutation analysis revealed that patients from vulnerable population subgroups were over-represented among patients with the most predictive diagnostic patterns. We hypothesize that our models performed better on under-represented groups because the features more strongly associated with obesity were more commonly observed among minority patients. These findings highlight the complex ways that bias may arise in machine learning models and can be incorporated into future research to develop a thorough analytical approach to identify and mitigate bias that may arise from features and within EHR datasets when developing more equitable models.Author summary: Childhood obesity is a pressing health issue. Machine learning methods are useful tools to study and predict the condition. Electronic Health Record (EHR) data may be used in clinical research to develop solutions and improve outcomes for pressing health issues such as pediatric obesity. However, EHR data may contain biases that impact how machine learning models perform for marginalized patient subgroups. In this paper, we present a comprehensive framework of how bias may be present within EHR data and external sources of bias in the model development process. Our pediatric obesity case study describes a detailed exploration of a real-world machine learning model to contextualize how concepts related to EHR data and machine learning model bias occur in an applied setting. We describe how we evaluated our models for bias, and considered how these results are representative of health disparity issues related to pediatric obesity. Our paper adds to the limited body of literature on the use of machine learning methods to study pediatric obesity and investigates the potential pitfalls in using a machine learning approach when studying socially significant health issues.

Suggested Citation

  • Elizabeth A Campbell & Saurav Bose & Aaron J Masino, 2024. "Conceptualizing bias in EHR data: A case study in performance disparities by demographic subgroups for a pediatric obesity incidence classifier," PLOS Digital Health, Public Library of Science, vol. 3(10), pages 1-17, October.
  • Handle: RePEc:plo:pdig00:0000642
    DOI: 10.1371/journal.pdig.0000642
    as

    Download full text from publisher

    File URL: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000642
    Download Restriction: no

    File URL: https://journals.plos.org/digitalhealth/article/file?id=10.1371/journal.pdig.0000642&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pdig.0000642?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Hee Yun Seol & Pragya Shrestha & Joy Fladager Muth & Chung-Il Wi & Sunghwan Sohn & Euijung Ryu & Miguel Park & Kathy Ihrke & Sungrim Moon & Katherine King & Philip Wheeler & Bijan Borah & James Moriar, 2021. "Artificial intelligence-assisted clinical decision support for childhood asthma management: A randomized clinical trial," PLOS ONE, Public Library of Science, vol. 16(8), pages 1-16, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Trudeau, Kimberlee J. & Yang, Jichen & Di, Jiaming & Lu, Yi & Kraus, David R., 2023. "Predicting successful placements for youth in child welfare with machine learning," Children and Youth Services Review, Elsevier, vol. 153(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0000642. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.