IDEAS home Printed from https://ideas.repec.org/a/plo/pdig00/0000578.html
   My bibliography  Save this article

Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study

Author

Listed:
  • Masao Iwagami
  • Ryota Inokuchi
  • Eiryo Kawakami
  • Tomohide Yamada
  • Atsushi Goto
  • Toshiki Kuno
  • Yohei Hashimoto
  • Nobuaki Michihata
  • Tadahiro Goto
  • Tomohiro Shinozaki
  • Yu Sun
  • Yuta Taniguchi
  • Jun Komiyama
  • Kazuaki Uda
  • Toshikazu Abe
  • Nanako Tamiya

Abstract

It is expected but unknown whether machine-learning models can outperform regression models, such as a logistic regression (LR) model, especially when the number and types of predictor variables increase in electronic health records (EHRs). We aimed to compare the predictive performance of gradient-boosted decision tree (GBDT), random forest (RF), deep neural network (DNN), and LR with the least absolute shrinkage and selection operator (LR-LASSO) for unplanned readmission. We used EHRs of patients discharged alive from 38 hospitals in 2015–2017 for derivation and in 2018 for validation, including basic characteristics, diagnosis, surgery, procedure, and drug codes, and blood-test results. The outcome was 30-day unplanned readmission. We created six patterns of data tables having different numbers of binary variables (that ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. For each pattern of data tables, we used the derivation data to establish the machine-learning and LR models, and used the validation data to evaluate the performance of each model. The incidence of outcome was 6.8% (23,108/339,513 discharges) and 6.4% (7,507/118,074 discharges) in the derivation and validation datasets, respectively. For the first data table with the smallest number of variables (102 variables that ≥5% of patients had, without blood-test results), the c-statistic was highest for GBDT (0.740), followed by RF (0.734), LR-LASSO (0.720), and DNN (0.664). For the last data table with the largest number of variables (1543 variables that ≥10 patients had, including blood-test results), the c-statistic was highest for GBDT (0.764), followed by LR-LASSO (0.755), RF (0.751), and DNN (0.720), suggesting that the difference between GBDT and LR-LASSO was small and their 95% confidence intervals overlapped. In conclusion, GBDT generally outperformed LR-LASSO to predict unplanned readmission, but the difference of c-statistic became smaller as the number of variables was increased and blood-test results were used.Author summary: It has been controversial over whether machine-learning models can outperform traditional statistical models, such as a logistic regression (LR) model, for the prediction of hospital readmission in electronic health records (EHRs). Therefore, this study aimed to systematically compare the predictive performance of the 30-day unplanned readmission among several machine-learning models and a LR model. We created 6 patterns of data tables according to the number of binary predictor variables (that ≥5% or ≥1% of patients, or ≥10 patients had) with and without blood-test results, expecting that some machine-learning models may outperform the LR model more prominently if the data become richer. We found that the gradient-boosting decision tree (one of machine-learning models) generally outperformed the LR model. However, against our expectation, the difference in the predictive performance between them was smaller in the last data table with the largest number of variables (1543 variables including blood-test results). Thus, this study concludes that the superiority of machine-learning methods to traditional statistical models may not be larger in EHRs with richer information. Future studies should focus on other potential predictors in EHRs, such as images and processed natural language, for demonstrating the superior performance of machine-learning methods to traditional statistical models.

Suggested Citation

  • Masao Iwagami & Ryota Inokuchi & Eiryo Kawakami & Tomohide Yamada & Atsushi Goto & Toshiki Kuno & Yohei Hashimoto & Nobuaki Michihata & Tadahiro Goto & Tomohiro Shinozaki & Yu Sun & Yuta Taniguchi & J, 2024. "Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study," PLOS Digital Health, Public Library of Science, vol. 3(8), pages 1-16, August.
  • Handle: RePEc:plo:pdig00:0000578
    DOI: 10.1371/journal.pdig.0000578
    as

    Download full text from publisher

    File URL: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000578
    Download Restriction: no

    File URL: https://journals.plos.org/digitalhealth/article/file?id=10.1371/journal.pdig.0000578&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pdig.0000578?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0000578. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.