IDEAS home Printed from https://ideas.repec.org/a/spr/compst/v35y2020i3d10.1007_s00180-019-00929-4.html
   My bibliography  Save this article

Random forest with acceptance–rejection trees

Author

Listed:
  • Peter Calhoun

    (Jaeb Center for Health Research)

  • Melodie J. Hallett

    (San Diego State University)

  • Xiaogang Su

    (University of Texas)

  • Guy Cafri

    (Johnson & Johnson Medical Devices)

  • Richard A. Levine

    (San Diego State University
    San Diego State University)

  • Juanjuan Fan

    (San Diego State University)

Abstract

In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.

Suggested Citation

  • Peter Calhoun & Melodie J. Hallett & Xiaogang Su & Guy Cafri & Richard A. Levine & Juanjuan Fan, 2020. "Random forest with acceptance–rejection trees," Computational Statistics, Springer, vol. 35(3), pages 983-999, September.
  • Handle: RePEc:spr:compst:v:35:y:2020:i:3:d:10.1007_s00180-019-00929-4
    DOI: 10.1007/s00180-019-00929-4
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00180-019-00929-4
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00180-019-00929-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Ishwaran, Hemant & Kogalur, Udaya B. & Gorodeski, Eiran Z. & Minn, Andy J. & Lauer, Michael S., 2010. "High-Dimensional Variable Selection for Survival Data," Journal of the American Statistical Association, American Statistical Association, vol. 105(489), pages 205-217.
    2. Fan, Juanjuan & Su, Xiao-Gang & Levine, Richard A. & Nunn, Martha E. & LeBlanc, Michael, 2006. "Trees for Correlated Survival Data by Goodness of Split, With Applications to Tooth Prognosis," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 959-967, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zemin Zheng & Jie Zhang & Yang Li, 2022. "L 0 -Regularized Learning for High-Dimensional Additive Hazards Regression," INFORMS Journal on Computing, INFORMS, vol. 34(5), pages 2762-2775, September.
    2. Jung-sik Hong & Hyeongyu Yeo & Nam-Wook Cho & Taeuk Ahn, 2018. "Identification of Core Suppliers Based on E-Invoice Data Using Supervised Machine Learning," JRFM, MDPI, vol. 11(4), pages 1-13, October.
    3. Shang-Ming Zhou & Fabiola Fernandez-Gutierrez & Jonathan Kennedy & Roxanne Cooksey & Mark Atkinson & Spiros Denaxas & Stefan Siebert & William G Dixon & Terence W O’Neill & Ernest Choy & Cathie Sudlow, 2016. "Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-14, May.
    4. Youngjoo Cho & Debashis Ghosh, 2021. "Quantile-Based Subgroup Identification for Randomized Clinical Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(1), pages 90-128, April.
    5. Ismael Ahrazem Dfuf & José Manuel Mira McWilliams & María Camino González Fernández, 2019. "Multi-Output Conditional Inference Trees Applied to the Electricity Market: Variable Importance Analysis," Energies, MDPI, vol. 12(6), pages 1-24, March.
    6. Han, Dongxiao & Huang, Jian & Lin, Yuanyuan & Shen, Guohao, 2022. "Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors," Journal of Econometrics, Elsevier, vol. 230(2), pages 416-431.
    7. Makariou, Despoina & Barrieu, Pauline & Chen, Yining, 2021. "A random forest based approach for predicting spreads in the primary catastrophe bond market," Insurance: Mathematics and Economics, Elsevier, vol. 101(PB), pages 140-162.
    8. Christine Porzelius & Martin Schumacher & Harald Binder, 2011. "The benefit of data-based model complexity selection via prediction error curves in time-to-event data," Computational Statistics, Springer, vol. 26(2), pages 293-302, June.
    9. Foucher Yohann & Danger Richard, 2012. "Time Dependent ROC Curves for the Estimation of True Prognostic Capacity of Microarray Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(6), pages 1-22, November.
    10. J. Choi & S. Ye & K. H. Eng & K. Korthauer & W. H. Bradley & J. S. Rader & C. Kendziorski, 2017. "IPI59: An Actionable Biomarker to Improve Treatment Response in Serous Ovarian Carcinoma Patients," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 9(1), pages 1-12, June.
    11. Rossella Tatoli & Luisa Lampignano & Rossella Donghia & Alfredo Niro & Fabio Castellana & Ilaria Bortone & Roberta Zupo & Sarah Tirelli & Madia Lozupone & Francesco Panza & Giovanni Alessio & Francesc, 2023. "Retinal Microvasculature and Neural Changes and Dietary Patterns in an Older Population in Southern Italy," IJERPH, MDPI, vol. 20(6), pages 1-17, March.
    12. Hoora Moradian & Denis Larocque & François Bellavance, 2017. "$$L_1$$ L 1 splitting rules in survival forests," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 23(4), pages 671-691, October.
    13. Yiwei Fan & Gang Wang & Xiaoling Lu & Gaobin Wang, 2019. "Distributed forecasting and ant colony optimization for the bike-sharing rebalancing problem with unserved demands," PLOS ONE, Public Library of Science, vol. 14(12), pages 1-26, December.
    14. Eiran Z Gorodeski & Emer Joyce & Benjamin T Gandesbery & Eugene H Blackstone & David O Taylor & W H Wilson Tang & Randall C Starling & Rory Hachamovitch, 2017. "Discordance between 'actual' and 'scheduled' check-in times at a heart failure clinic," PLOS ONE, Public Library of Science, vol. 12(11), pages 1-13, November.
    15. Makariou, Despoina & Barrieu, Pauline & Chen, Yining, 2021. "A random forest based approach for predicting spreads in the primary catastrophe bond market," LSE Research Online Documents on Economics 111529, London School of Economics and Political Science, LSE Library.
    16. Wei-Yin Loh, 2014. "Fifty Years of Classification and Regression Trees," International Statistical Review, International Statistical Institute, vol. 82(3), pages 329-348, December.
    17. Dine, Abdessamad & Larocque, Denis & Bellavance, François, 2009. "Multivariate trees for mixed outcomes," Computational Statistics & Data Analysis, Elsevier, vol. 53(11), pages 3795-3804, September.
    18. Mao, Xiaojun & Peng, Liuhua & Wang, Zhonglei, 2022. "Nonparametric feature selection by random forests and deep neural networks," Computational Statistics & Data Analysis, Elsevier, vol. 170(C).
    19. Julia Gilhodes & Florence Dalenc & Jocelyn Gal & Christophe Zemmour & Eve Leconte & Jean Marie Boher & Thomas Filleron, 2020. "Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings," Post-Print hal-02934793, HAL.
    20. Demir Djekic & Erika Fagman & Oskar Angerås & George Lappas & Kjell Torén & Göran Bergström & Annika Rosengren, 2020. "Social Support and Subclinical Coronary Artery Disease in Middle-Aged Men and Women: Findings from the Pilot of Swedish CArdioPulmonary bioImage Study," IJERPH, MDPI, vol. 17(3), pages 1-16, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:35:y:2020:i:3:d:10.1007_s00180-019-00929-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.