IDEAS home Printed from https://ideas.repec.org/a/gam/jijerp/v19y2022i23p16080-d990266.html
   My bibliography  Save this article

Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series

Author

Listed:
  • Liangyuan Hu

    (Department of Biostatistics and Epidemiology, Rutgers University, Piscataway, NJ 08854, USA)

  • Lihua Li

    (Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA)

Abstract

Tree-based machine learning methods have gained traction in the statistical and data science fields. They have been shown to provide better solutions to various research questions than traditional analysis approaches. To encourage the uptake of tree-based methods in health research, we review the methodological fundamentals of three key tree-based machine learning methods: random forests, extreme gradient boosting and Bayesian additive regression trees. We further conduct a series of case studies to illustrate how these methods can be properly used to solve important health research problems in four domains: variable selection, estimation of causal effects, propensity score weighting and missing data. We exposit that the central idea of using ensemble tree methods for these research questions is accurate prediction via flexible modeling. We applied ensemble trees methods to select important predictors for the presence of postoperative respiratory complication among early stage lung cancer patients with resectable tumors. We then demonstrated how to use these methods to estimate the causal effects of popular surgical approaches on postoperative respiratory complications among lung cancer patients. Using the same data, we further implemented the methods to accurately estimate the inverse probability weights for a propensity score analysis of the comparative effectiveness of the surgical approaches. Finally, we demonstrated how random forests can be used to impute missing data using the Study of Women’s Health Across the Nation data set. To conclude, the tree-based methods are a flexible tool and should be properly used for health investigations.

Suggested Citation

  • Liangyuan Hu & Lihua Li, 2022. "Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series," IJERPH, MDPI, vol. 19(23), pages 1-13, December.
  • Handle: RePEc:gam:jijerp:v:19:y:2022:i:23:p:16080-:d:990266
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1660-4601/19/23/16080/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1660-4601/19/23/16080/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. van Buuren, Stef & Groothuis-Oudshoorn, Karin, 2011. "mice: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i03).
    2. Stefan Wager & Susan Athey, 2018. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 1228-1242, July.
    3. Liangyuan Hu & Joseph W. Hogan & Ann W. Mwangi & Abraham Siika, 2018. "Modeling the causal effect of treatment initiation time on survival: Application to HIV/TB co†infection," Biometrics, The International Biometric Society, vol. 74(2), pages 703-713, June.
    4. Liangyuan Hu & Joseph W. Hogan, 2019. "Causal comparative effectiveness analysis of dynamic continuous‐time treatment initiation rules with sparsely measured outcomes and death," Biometrics, The International Biometric Society, vol. 75(2), pages 695-707, June.
    5. Liangyuan Hu & Jiayi Ji & Hao Liu & Ronald Ennis, 2022. "A Flexible Approach for Assessing Heterogeneity of Causal Treatment Effects on Patient Survival Using Large Datasets with Clustered Observations," IJERPH, MDPI, vol. 19(22), pages 1-6, November.
    6. Hapfelmeier, A. & Ulm, K., 2013. "A new variable selection approach using Random Forests," Computational Statistics & Data Analysis, Elsevier, vol. 60(C), pages 50-69.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kevin Blattler & Hannes Wallimann & Widar von Arx, 2024. "Free public transport to the destination: A causal analysis of tourists' travel mode choice," Papers 2401.14945, arXiv.org, revised Feb 2024.
    2. Youmi Suk & Hyunseung Kang, 2022. "Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding," Psychometrika, Springer;The Psychometric Society, vol. 87(1), pages 310-343, March.
    3. Hapfelmeier, A. & Ulm, K., 2014. "Variable selection by Random Forests using data with missing values," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 129-139.
    4. Noémi Kreif & Richard Grieve & Iván Díaz & David Harrison, 2015. "Evaluation of the Effect of a Continuous Treatment: A Machine Learning Approach with an Application to Treatment for Traumatic Brain Injury," Health Economics, John Wiley & Sons, Ltd., vol. 24(9), pages 1213-1228, September.
    5. Lechner, Michael, 2018. "Modified Causal Forests for Estimating Heterogeneous Causal Effects," IZA Discussion Papers 12040, Institute of Labor Economics (IZA).
    6. William Arbour, 2021. "Can Recidivism be Prevented from Behind Bars? Evidence from a Behavioral Program," Working Papers tecipa-683, University of Toronto, Department of Economics.
    7. Alexandre Belloni & Victor Chernozhukov & Denis Chetverikov & Christian Hansen & Kengo Kato, 2018. "High-dimensional econometrics and regularized GMM," CeMMAP working papers CWP35/18, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    8. Dimitris Bertsimas & Agni Orfanoudaki & Rory B. Weiner, 2020. "Personalized treatment for coronary artery disease patients: a machine learning approach," Health Care Management Science, Springer, vol. 23(4), pages 482-506, December.
    9. Nicolaj N. Mühlbach, 2020. "Tree-based Synthetic Control Methods: Consequences of moving the US Embassy," CREATES Research Papers 2020-04, Department of Economics and Business Economics, Aarhus University.
    10. Abhilash Bandam & Eedris Busari & Chloi Syranidou & Jochen Linssen & Detlef Stolten, 2022. "Classification of Building Types in Germany: A Data-Driven Modeling Approach," Data, MDPI, vol. 7(4), pages 1-23, April.
    11. Kyle Colangelo & Ying-Ying Lee, 2019. "Double debiased machine learning nonparametric inference with continuous treatments," CeMMAP working papers CWP72/19, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    12. Shonosuke Sugasawa & Hisashi Noma, 2021. "Efficient screening of predictive biomarkers for individual treatment selection," Biometrics, The International Biometric Society, vol. 77(1), pages 249-257, March.
    13. Ruoxuan Xiong & Allison Koenecke & Michael Powell & Zhu Shen & Joshua T. Vogelstein & Susan Athey, 2021. "Federated Causal Inference in Heterogeneous Observational Data," Papers 2107.11732, arXiv.org, revised Apr 2023.
    14. Boonstra Philip S. & Little Roderick J.A. & West Brady T. & Andridge Rebecca R. & Alvarado-Leiton Fernanda, 2021. "A Simulation Study of Diagnostics for Selection Bias," Journal of Official Statistics, Sciendo, vol. 37(3), pages 751-769, September.
    15. Stephen Jarvis & Olivier Deschenes & Akshaya Jha, 2022. "The Private and External Costs of Germany’s Nuclear Phase-Out," Journal of the European Economic Association, European Economic Association, vol. 20(3), pages 1311-1346.
    16. Hayakawa, Kazunobu & Keola, Souknilanh & Silaphet, Korrakoun & Yamanouchi, Kenta, 2022. "Estimating the impacts of international bridges on foreign firm locations: a machine learning approach," IDE Discussion Papers 847, Institute of Developing Economies, Japan External Trade Organization(JETRO).
    17. Davide Viviano & Jelena Bradic, 2019. "Synthetic learner: model-free inference on treatments over time," Papers 1904.01490, arXiv.org, revised Aug 2022.
    18. Naguib, Costanza, 2019. "Estimating the Heterogeneous Impact of the Free Movement of Persons on Relative Wage Mobility," Economics Working Paper Series 1903, University of St. Gallen, School of Economics and Political Science.
    19. Labro, Eva & Lang, Mark & Omartian, James D., 2023. "Predictive analytics and centralization of authority," Journal of Accounting and Economics, Elsevier, vol. 75(1).
    20. Rina Friedberg & Julie Tibshirani & Susan Athey & Stefan Wager, 2018. "Local Linear Forests," Papers 1807.11408, arXiv.org, revised Sep 2020.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jijerp:v:19:y:2022:i:23:p:16080-:d:990266. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.