IDEAS home Printed from https://ideas.repec.org/a/bla/jorssb/v83y2021i3p453-481.html
   My bibliography  Save this article

Variable selection with ABC Bayesian forests

Author

Listed:
  • Yi Liu
  • Veronika Ročková
  • Yuexi Wang

Abstract

Few problems in statistics are as perplexing as variable selection in the presence of very many redundant covariates. The variable selection problem is most familiar in parametric environments such as the linear model or additive variants thereof. In this work, we abandon the linear model framework, which can be quite detrimental when the covariates impact the outcome in a non‐linear way, and turn to tree‐based methods for variable selection. Such variable screening is traditionally done by pruning down large trees or by ranking variables based on some importance measure. Despite heavily used in practice, these ad hoc selection rules are not yet well understood from a theoretical point of view. In this work, we devise a Bayesian tree‐based probabilistic method and show that it is consistent for variable selection when the regression surface is a smooth mix of p > n covariates. These results are the first model selection consistency results for Bayesian forest priors. Probabilistic assessment of variable importance is made feasible by a spike‐and‐slab wrapper around sum‐of‐trees priors. Sampling from posterior distributions over trees is inherently very difficult. As an alternative to Markov Chain Monte Carlo (MCMC), we propose approximate Bayesian computation (ABC) Bayesian forests, a new ABC sampling method based on data‐splitting that achieves higher ABC acceptance rate. We show that the method is robust and successful at finding variables with high marginal inclusion probabilities. Our ABC algorithm provides a new avenue towards approximating the median probability model in non‐parametric setups where the marginal likelihood is intractable.

Suggested Citation

  • Yi Liu & Veronika Ročková & Yuexi Wang, 2021. "Variable selection with ABC Bayesian forests," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(3), pages 453-481, July.
  • Handle: RePEc:bla:jorssb:v:83:y:2021:i:3:p:453-481
    DOI: 10.1111/rssb.12423
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssb.12423
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssb.12423?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Mikael Sunnåker & Alberto Giovanni Busetto & Elina Numminen & Jukka Corander & Matthieu Foll & Christophe Dessimoz, 2013. "Approximate Bayesian Computation," PLOS Computational Biology, Public Library of Science, vol. 9(1), pages 1-10, January.
    2. Veronika Ročková & Edward I. George, 2014. "EMVS: The EM Approach to Bayesian Variable Selection," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(506), pages 828-846, June.
    3. Faming Liang & Qizhai Li & Lei Zhou, 2018. "Bayesian Neural Networks for Selection of Drug Sensitive Genes," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 955-972, July.
    4. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    5. Radchenko, Peter & James, Gareth M., 2010. "Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions," Journal of the American Statistical Association, American Statistical Association, vol. 105(492), pages 1541-1553.
    6. Scheipl, Fabian, 2011. "spikeSlabGAM: Bayesian Variable Selection, Model Choice and Regularization for Generalized Additive Mixed Models in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 43(i14).
    7. Taddy, Matthew A. & Gramacy, Robert B. & Polson, Nicholas G., 2011. "Dynamic Trees for Learning and Design," Journal of the American Statistical Association, American Statistical Association, vol. 106(493), pages 109-123.
    8. Emmanuel Candès & Yingying Fan & Lucas Janson & Jinchi Lv, 2018. "Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 80(3), pages 551-577, June.
    9. Pradeep Ravikumar & John Lafferty & Han Liu & Larry Wasserman, 2009. "Sparse additive models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(5), pages 1009-1030, November.
    10. Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
    11. David T. Frazier & Christian P. Robert & Judith Rousseau, 2020. "Model misspecification in approximate Bayesian computation: consequences and diagnostics," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(2), pages 421-444, April.
    12. Ruoqing Zhu & Donglin Zeng & Michael R. Kosorok, 2015. "Reinforcement Learning Trees," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1770-1784, December.
    13. Antonio R. Linero, 2018. "Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(522), pages 626-636, April.
    14. repec:dau:papers:123456789/6334 is not listed on IDEAS
    15. Gramacy, Robert B & Lee, Herbert K. H, 2008. "Bayesian Treed Gaussian Process Models With an Application to Computer Modeling," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 1119-1130.
    16. Wentao Li & Paul Fearnhead, 2018. "Convergence of regression-adjusted approximate Bayesian computation," Biometrika, Biometrika Trust, vol. 105(2), pages 301-318.
    17. D T Frazier & G M Martin & C P Robert & J Rousseau, 2018. "Asymptotic properties of approximate Bayesian computation," Biometrika, Biometrika Trust, vol. 105(3), pages 593-607.
    18. repec:dau:papers:123456789/5724 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Bhatnagar, Sahir R. & Lu, Tianyuan & Lovato, Amanda & Olds, David L. & Kobor, Michael S. & Meaney, Michael J. & O'Donnell, Kieran & Yang, Archer Y. & Greenwood, Celia M.T., 2023. "A sparse additive model for high-dimensional interactions with an exposure variable," Computational Statistics & Data Analysis, Elsevier, vol. 179(C).
    2. Gael M. Martin & David T. Frazier & Christian P. Robert, 2020. "Computing Bayes: Bayesian Computation from 1763 to the 21st Century," Monash Econometrics and Business Statistics Working Papers 14/20, Monash University, Department of Econometrics and Business Statistics.
    3. Bernardi, Mauro & Costola, Michele, 2019. "High-dimensional sparse financial networks through a regularised regression model," SAFE Working Paper Series 244, Leibniz Institute for Financial Research SAFE.
    4. Loann David Denis Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," Econometrics, MDPI, vol. 6(4), pages 1-27, November.
    5. Henri Pesonen & Umberto Simola & Alvaro Köhn‐Luque & Henri Vuollekoski & Xiaoran Lai & Arnoldo Frigessi & Samuel Kaski & David T. Frazier & Worapree Maneesoonthorn & Gael M. Martin & Jukka Corander, 2023. "ABC of the future," International Statistical Review, International Statistical Institute, vol. 91(2), pages 243-268, August.
    6. Pedro Delicado & Daniel Peña, 2023. "Understanding complex predictive models with ghost variables," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 32(1), pages 107-145, March.
    7. Gael M. Martin & David T. Frazier & Christian P. Robert, 2021. "Approximating Bayes in the 21st Century," Monash Econometrics and Business Statistics Working Papers 24/21, Monash University, Department of Econometrics and Business Statistics.
    8. Posch, Konstantin & Arbeiter, Maximilian & Pilz, Juergen, 2020. "A novel Bayesian approach for variable selection in linear regression models," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    9. Fabian Scheipl & Thomas Kneib & Ludwig Fahrmeir, 2013. "Penalized likelihood and Bayesian function selection in regression models," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 97(4), pages 349-385, October.
    10. Diego Vidaurre & Concha Bielza & Pedro Larrañaga, 2013. "A Survey of L1 Regression," International Statistical Review, International Statistical Institute, vol. 81(3), pages 361-387, December.
    11. Xia Zheng & Yaohua Rong & Ling Liu & Weihu Cheng, 2021. "A More Accurate Estimation of Semiparametric Logistic Regression," Mathematics, MDPI, vol. 9(19), pages 1-12, September.
    12. Du, Pang & Cheng, Guang & Liang, Hua, 2012. "Semiparametric regression models with additive nonparametric components and high dimensional parametric components," Computational Statistics & Data Analysis, Elsevier, vol. 56(6), pages 2006-2017.
    13. Oyebayo Ridwan Olaniran & Ali Rashash R. Alzahrani, 2023. "On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression," Mathematics, MDPI, vol. 11(24), pages 1-29, December.
    14. Adel Javanmard & Jason D. Lee, 2020. "A flexible framework for hypothesis testing in high dimensions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(3), pages 685-718, July.
    15. Sayanti Guha Majumdar & Anil Rai & Dwijesh Chandra Mishra, 2023. "Estimation of Error Variance in Genomic Selection for Ultrahigh Dimensional Data," Agriculture, MDPI, vol. 13(4), pages 1-16, April.
    16. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    17. Fan, Jianqing & Jiang, Bai & Sun, Qiang, 2022. "Bayesian factor-adjusted sparse regression," Journal of Econometrics, Elsevier, vol. 230(1), pages 3-19.
    18. Hang Yu & Yuanjia Wang & Donglin Zeng, 2023. "A general framework of nonparametric feature selection in high‐dimensional data," Biometrics, The International Biometric Society, vol. 79(2), pages 951-963, June.
    19. Rina Friedberg & Julie Tibshirani & Susan Athey & Stefan Wager, 2018. "Local Linear Forests," Papers 1807.11408, arXiv.org, revised Sep 2020.
    20. Hui Xiao & Yiguo Sun, 2020. "Forecasting the Returns of Cryptocurrency: A Model Averaging Approach," JRFM, MDPI, vol. 13(11), pages 1-15, November.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssb:v:83:y:2021:i:3:p:453-481. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.