IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v60y2013icp50-69.html
   My bibliography  Save this article

A new variable selection approach using Random Forests

Author

Listed:
  • Hapfelmeier, A.
  • Ulm, K.

Abstract

Random Forests are frequently applied as they achieve a high prediction accuracy and have the ability to identify informative variables. Several approaches for variable selection have been proposed to combine and intensify these qualities. An extensive review of the corresponding literature led to the development of a new approach that is based on the theoretical framework of permutation tests and meets important statistical properties. A comparison to another eight popular variable selection methods in three simulation studies and four real data applications indicated that: the new approach can also be used to control the test-wise and family-wise error rate, provides a higher power to distinguish relevant from irrelevant variables and leads to models which are located among the very best performing ones. In addition, it is equally applicable to regression and classification problems.

Suggested Citation

  • Hapfelmeier, A. & Ulm, K., 2013. "A new variable selection approach using Random Forests," Computational Statistics & Data Analysis, Elsevier, vol. 60(C), pages 50-69.
  • Handle: RePEc:eee:csdana:v:60:y:2013:i:c:p:50-69
    DOI: 10.1016/j.csda.2012.09.020
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947312003490
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2012.09.020?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Strobl, Carolin & Boulesteix, Anne-Laure & Augustin, Thomas, 2007. "Unbiased split selection for classification trees based on the Gini Index," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 483-501, September.
    2. Archer, Kellie J. & Kimes, Ryan V., 2008. "Empirical characterization of random forest variable importance measures," Computational Statistics & Data Analysis, Elsevier, vol. 52(4), pages 2249-2260, January.
    3. van Wieringen, Wessel N. & Kun, David & Hampel, Regina & Boulesteix, Anne-Laure, 2009. "Survival prediction using gene expression data: A review and comparison," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1590-1603, March.
    4. Willi Sauerbrei, 1999. "The Use of Resampling Methods to Simplify Regression Models in Medical Statistics," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 48(3), pages 313-329.
    5. Harrison, David Jr. & Rubinfeld, Daniel L., 1978. "Hedonic housing prices and the demand for clean air," Journal of Environmental Economics and Management, Elsevier, vol. 5(1), pages 81-102, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Ingrida Vaiciulyte & Zivile Kalsyte & Leonidas Sakalauskas & Darius Plikynas, 2017. "Assessment of market reaction on the share performance on the basis of its visualization in 2D space," Journal of Business Economics and Management, Taylor & Francis Journals, vol. 18(2), pages 309-318, March.
    2. Hapfelmeier, Alexander & Hornung, Roman & Haller, Bernhard, 2023. "Efficient permutation testing of variable importance measures by the example of random forests," Computational Statistics & Data Analysis, Elsevier, vol. 181(C).
    3. Hapfelmeier, A. & Ulm, K., 2014. "Variable selection by Random Forests using data with missing values," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 129-139.
    4. Jin Li & Maggie Tran & Justy Siwabessy, 2016. "Selecting Optimal Random Forest Predictive Models: A Case Study on Predicting the Spatial Distribution of Seabed Hardness," PLOS ONE, Public Library of Science, vol. 11(2), pages 1-29, February.
    5. Zardad Khan & Asma Gul & Aris Perperoglou & Miftahuddin Miftahuddin & Osama Mahmoud & Werner Adler & Berthold Lausen, 2020. "Ensemble of optimal trees, random forest and random projection ensemble classification," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(1), pages 97-116, March.
    6. Saurabh Saxena & Darius Roman & Valentin Robu & David Flynn & Michael Pecht, 2021. "Battery Stress Factor Ranking for Accelerated Degradation Test Planning Using Machine Learning," Energies, MDPI, vol. 14(3), pages 1-17, January.
    7. Abellán, Joaquín & Baker, Rebecca M. & Coolen, Frank P.A. & Crossman, Richard J. & Masegosa, Andrés R., 2014. "Classification with decision trees from a nonparametric predictive inference perspective," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 789-802.
    8. Liangyuan Hu & Lihua Li, 2022. "Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series," IJERPH, MDPI, vol. 19(23), pages 1-13, December.
    9. Dogah, Kingsley E. & Premaratne, Gamini, 2018. "Sectoral exposure of financial markets to oil risk factors in BRICS countries," Energy Economics, Elsevier, vol. 76(C), pages 228-256.
    10. Weijun Wang & Dan Zhao & Liguo Fan & Yulong Jia, 2019. "Study on Icing Prediction of Power Transmission Lines Based on Ensemble Empirical Mode Decomposition and Feature Selection Optimized Extreme Learning Machine," Energies, MDPI, vol. 12(11), pages 1-21, June.
    11. Hermel Homburger & Manuel K Schneider & Sandra Hilfiker & Andreas Lüscher, 2014. "Inferring Behavioral States of Grazing Livestock from High-Frequency Position Data Alone," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-22, December.
    12. Barbara Baranowska & Anna Kajdy & Paulina Pawlicka & Ernest Pokropek & Michał Rabijewski & Dorota Sys & Artur Pokropek, 2020. "What are the Critical Elements of Satisfaction and Experience in Labor and Childbirth—A Cross-Sectional Study," IJERPH, MDPI, vol. 17(24), pages 1-13, December.
    13. Fellinghauer, Bernd & Bühlmann, Peter & Ryffel, Martin & von Rhein, Michael & Reinhardt, Jan D., 2013. "Stable graphical model estimation with Random Forests for discrete, continuous, and mixed variables," Computational Statistics & Data Analysis, Elsevier, vol. 64(C), pages 132-152.
    14. Massimiliano Fessina & Giambattista Albora & Andrea Tacchella & Andrea Zaccaria, 2022. "Which products activate a product? An explainable machine learning approach," Papers 2212.03094, arXiv.org.
    15. Lkhagvadorj Munkhdalai & Tsendsuren Munkhdalai & Oyun-Erdene Namsrai & Jong Yun Lee & Keun Ho Ryu, 2019. "An Empirical Comparison of Machine-Learning Methods on Bank Client Credit Assessments," Sustainability, MDPI, vol. 11(3), pages 1-23, January.
    16. Michael Dadole Ubagan & Yun-Sik Lee & Taekjun Lee & Jinsol Hong & Il Hoi Kim & Sook Shin, 2021. "Settlement and Recruitment Potential of Four Invasive and One Indigenous Barnacles in South Korea and Their Future," Sustainability, MDPI, vol. 13(2), pages 1-14, January.
    17. Edward Gage & David Cooper, 2015. "The Influence of Land Cover, Vertical Structure, and Socioeconomic Factors on Outdoor Water Use in a Western US City," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 29(10), pages 3877-3890, August.
    18. Bryan Keller, 2020. "Variable Selection for Causal Effect Estimation: Nonparametric Conditional Independence Testing With Random Forests," Journal of Educational and Behavioral Statistics, , vol. 45(2), pages 119-142, April.
    19. Silke Janitza & Ender Celik & Anne-Laure Boulesteix, 2018. "A computationally fast variable importance test for random forests for high-dimensional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(4), pages 885-915, December.
    20. Polasek, Tomas & Čadík, Martin, 2023. "Predicting photovoltaic power production using high-uncertainty weather forecasts," Applied Energy, Elsevier, vol. 339(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Daniel L. Chen & Markus Loecher, 2022. "Mood and the Malleability of Moral Reasoning: The Impact of Irrelevant Factors on Judicial Decisions," Working Papers hal-03864854, HAL.
    2. Hapfelmeier, A. & Ulm, K., 2014. "Variable selection by Random Forests using data with missing values," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 129-139.
    3. Lucija Muehlenbachs & Elisheba Spiller & Christopher Timmins, 2015. "The Housing Market Impacts of Shale Gas Development," American Economic Review, American Economic Association, vol. 105(12), pages 3633-3659, December.
    4. Jianhong Shi & Qian Yang & Xiongya Li & Weixing Song, 2017. "Effects of measurement error on a class of single-index varying coefficient regression models," Computational Statistics, Springer, vol. 32(3), pages 977-1001, September.
    5. Hapfelmeier Alexander & Ulm Kurt & Hothorn Torsten & Riediger Carina, 2014. "Estimation of a Predictor’s Importance by Random Forests When There Is Missing Data: RISK Prediction in Liver Surgery using Laboratory Data," The International Journal of Biostatistics, De Gruyter, vol. 10(2), pages 1-19, November.
    6. Binh Thai Pham & Chongchong Qi & Lanh Si Ho & Trung Nguyen-Thoi & Nadhir Al-Ansari & Manh Duc Nguyen & Huu Duy Nguyen & Hai-Bang Ly & Hiep Van Le & Indra Prakash, 2020. "A Novel Hybrid Soft Computing Model Using Random Forest and Particle Swarm Optimization for Estimation of Undrained Shear Strength of Soil," Sustainability, MDPI, vol. 12(6), pages 1-16, March.
    7. Smith, Michael & Kohn, Robert, 1996. "Nonparametric regression using Bayesian variable selection," Journal of Econometrics, Elsevier, vol. 75(2), pages 317-343, December.
    8. Villalonga, Belen, 2004. "Intangible resources, Tobin's q, and sustainability of performance differences," Journal of Economic Behavior & Organization, Elsevier, vol. 54(2), pages 205-230, June.
    9. Brockmeier, M., 1991. "Entwicklung und Aufhebung von Reinheitsgeboten im Nahrungsmittelbereich – Analyse und Bewertung," Proceedings “Schriften der Gesellschaft für Wirtschafts- und Sozialwissenschaften des Landbaues e.V.”, German Association of Agricultural Economists (GEWISOLA), vol. 27.
    10. Miles M Finney, 2017. "Air Quality and the Development of Los Angeles," The Review of Regional Studies, Southern Regional Science Association, vol. 47(3), pages 271-288, Fall.
    11. Terri Menke, 1987. "Economic Welfare and Urban Amenities Across Race-Sex Groups," Urban Studies, Urban Studies Journal Limited, vol. 24(2), pages 151-161, April.
    12. Suneel Babu Chatla, 2023. "Nonparametric inference for additive models estimated via simplified smooth backfitting," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 75(1), pages 71-97, February.
    13. Miller, Steve & Startz, Richard, 2019. "Feasible generalized least squares using support vector regression," Economics Letters, Elsevier, vol. 175(C), pages 28-31.
    14. Chunfang Zhao & Yingliang Wu & Yunfeng Chen & Guohua Chen, 2023. "Multiscale Effects of Hedonic Attributes on Airbnb Listing Prices Based on MGWR: A Case Study of Beijing, China," Sustainability, MDPI, vol. 15(2), pages 1-21, January.
    15. Umberto Amato & Anestis Antoniadis & Italia De Feis & Irene Gijbels, 2021. "Penalised robust estimators for sparse and high-dimensional linear models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(1), pages 1-48, March.
    16. Prendergast, Luke A. & Li Wai Suen, Connie, 2011. "A new and practical influence measure for subsets of covariance matrix sample principal components with applications to high dimensional datasets," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 752-764, January.
    17. repec:asg:wpaper:1006 is not listed on IDEAS
    18. Tizheng Li & Xiaojuan Kang, 2022. "Variable selection of higher-order partially linear spatial autoregressive model with a diverging number of parameters," Statistical Papers, Springer, vol. 63(1), pages 243-285, February.
    19. Deac Dan Stelian & Schebesch Klaus Bruno, 2018. "Market Forecasts and Client Behavioral Data: Towards Finding Adequate Model Complexity," Studia Universitatis „Vasile Goldis” Arad – Economics Series, Sciendo, vol. 28(3), pages 50-75, September.
    20. James Hansen & James McDonald & Panayiotis Theodossiou & Brad Larsen, 2010. "Partially Adaptive Econometric Methods For Regression and Classification," Computational Economics, Springer;Society for Computational Economics, vol. 36(2), pages 153-169, August.
    21. Jörg Kalbfuß & Reto Odermatt & Alois Stutzer, 2018. "Medical marijuana laws and mental health in the United States," CEP Discussion Papers dp1546, Centre for Economic Performance, LSE.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:60:y:2013:i:c:p:50-69. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.