IDEAS home Printed from https://ideas.repec.org/a/taf/gnstxx/v24y2012i4p993-1006.html
   My bibliography  Save this article

Robustness of random forests for regression

Author

Listed:
  • Marie-Hélène Roy
  • Denis Larocque

Abstract

In this paper, we empirically investigate the robustness of random forests for regression problems. We also investigate the performance of six variations of the original random forest method, all aimed at improving robustness. These variations are based on three main ideas: (1) robustify the aggregation method, (2) robustify the splitting criterion and (3) taking a robust transformation of the response. More precisely, with the first idea, we use the median (or weighted median), instead of the mean, to combine the predictions from the individual trees. With the second idea, we use least-absolute deviations from the median, instead of least-squares, as splitting criterion. With the third idea, we build the trees using the ranks of the response instead of the original values. The competing methods are compared via a simulation study with artificial data using two different types of contaminations and also with 13 real data sets. Our results show that all three ideas improve the robustness of the original random forest algorithm. However, a robust aggregation of the individual trees is generally more profitable than a robust splitting criterion.

Suggested Citation

  • Marie-Hélène Roy & Denis Larocque, 2012. "Robustness of random forests for regression," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 24(4), pages 993-1006, December.
  • Handle: RePEc:taf:gnstxx:v:24:y:2012:i:4:p:993-1006
    DOI: 10.1080/10485252.2012.715161
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/10485252.2012.715161
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/10485252.2012.715161?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Peters, Jan & Baets, Bernard De & Verhoest, Niko E.C. & Samson, Roeland & Degroeve, Sven & Becker, Piet De & Huybrechts, Willy, 2007. "Random forests as a tool for ecohydrological distribution modelling," Ecological Modelling, Elsevier, vol. 207(2), pages 304-318.
    2. Rokach, Lior, 2009. "Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4046-4072, October.
    3. Lessmann, Stefan & Sung, Ming-Chien & Johnson, Johnnie E.V., 2010. "Alternative methods of predicting competitive events: An application in horserace betting markets," International Journal of Forecasting, Elsevier, vol. 26(3), pages 518-536, July.
    4. Biau, Gérard & Devroye, Luc, 2010. "On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification," Journal of Multivariate Analysis, Elsevier, vol. 101(10), pages 2499-2518, November.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Ju, Xiaomeng & Salibián-Barrera, Matías, 2021. "Robust boosting for regression problems," Computational Statistics & Data Analysis, Elsevier, vol. 153(C).
    2. Meng Zhang & Jiatong Ling & Buyun Tang & Shaohua Dong & Laibin Zhang, 2022. "A Data-Driven Based Method for Pipeline Additional Stress Prediction Subject to Landslide Geohazards," Sustainability, MDPI, vol. 14(19), pages 1-16, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yikalo H. Araya & Tarmo K. Remmel & Ajith H. Perera, 2016. "What governs the presence of residual vegetation in boreal wildfires?," Journal of Geographical Systems, Springer, vol. 18(2), pages 159-181, April.
    2. Baboota, Rahul & Kaur, Harleen, 2019. "Predictive analysis and modelling football results using machine learning approach for English Premier League," International Journal of Forecasting, Elsevier, vol. 35(2), pages 741-755.
    3. S Lessmann & M-C Sung & J E V Johnson, 2011. "Towards a methodology for measuring the true degree of efficiency in a speculative market," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 62(12), pages 2120-2132, December.
    4. Luu, Tung Duy & Fadili, Jalal & Chesneau, Christophe, 2019. "PAC-Bayesian risk bounds for group-analysis sparse regression by exponential weighting," Journal of Multivariate Analysis, Elsevier, vol. 171(C), pages 209-233.
    5. Sarah Mittlefehldt & Erin Bunting & Emily Huff & Joseph Welsh & Robert Goodwin, 2021. "New Methods for Assessing Sustainability of Wood-Burning Energy Facilities: Combining Historical and Spatial Approaches," Energies, MDPI, vol. 14(23), pages 1-18, November.
    6. Sachin Kumar & T. Gopi & N. Harikeerthana & Munish Kumar Gupta & Vidit Gaur & Grzegorz M. Krolczyk & ChuanSong Wu, 2023. "Machine learning techniques in additive manufacturing: a state of the art review on design, processes and production control," Journal of Intelligent Manufacturing, Springer, vol. 34(1), pages 21-55, January.
    7. Chun-Xia Zhang & Jiang-She Zhang & Sang-Woon Kim, 2016. "PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection," Computational Statistics, Springer, vol. 31(4), pages 1237-1262, December.
    8. Seyed Naghibi & Hamid Pourghasemi, 2015. "A Comparative Assessment Between Three Machine Learning Models and Their Performance Comparison by Bivariate and Multivariate Statistical Methods in Groundwater Potential Mapping," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 29(14), pages 5217-5236, November.
    9. Steffen Q. Mueller, 2020. "Pre- and within-season attendance forecasting in Major League Baseball: a random forest approach," Applied Economics, Taylor & Francis Journals, vol. 52(41), pages 4512-4528, September.
    10. Döpke, Jörg & Fritsche, Ulrich & Pierdzioch, Christian, 2017. "Predicting recessions with boosted regression trees," International Journal of Forecasting, Elsevier, vol. 33(4), pages 745-759.
    11. Bemah Ibrahim & Isaac Ahenkorah & Anthony Ewusi, 2022. "Explainable Risk Assessment of Rockbolts’ Failure in Underground Coal Mines Based on Categorical Gradient Boosting and SHapley Additive exPlanations (SHAP)," Sustainability, MDPI, vol. 14(19), pages 1-16, September.
    12. Dthenifer Cordeiro Santana & Regimar Garcia dos Santos & Pedro Henrique Neves da Silva & Hemerson Pistori & Larissa Pereira Ribeiro Teodoro & Nerison Luis Poersch & Gileno Brito de Azevedo & Glauce Ta, 2023. "Machine Learning Methods for Woody Volume Prediction in Eucalyptus," Sustainability, MDPI, vol. 15(14), pages 1-11, July.
    13. Hubáček, Ondřej & Šír, Gustav, 2023. "Beating the market with a bad predictive model," International Journal of Forecasting, Elsevier, vol. 39(2), pages 691-719.
    14. Biau, Gérard & Devroye, Luc & Dujmović, Vida & Krzyżak, Adam, 2012. "An affine invariant k-nearest neighbor regression estimate," Journal of Multivariate Analysis, Elsevier, vol. 112(C), pages 24-34.
    15. Mendez, Guillermo & Lohr, Sharon, 2011. "Estimating residual variance in random forest regression," Computational Statistics & Data Analysis, Elsevier, vol. 55(11), pages 2937-2950, November.
    16. John Martin & Sona Taheri & Mali Abdollahian, 2024. "Optimizing Ensemble Learning to Reduce Misclassification Costs in Credit Risk Scorecards," Mathematics, MDPI, vol. 12(6), pages 1, March.
    17. Vanesa Mateo-Pérez & Marina Corral-Bobadilla & Francisco Ortega-Fernández & Vicente Rodríguez-Montequín, 2021. "Determination of Water Depth in Ports Using Satellite Data Based on Machine Learning Algorithms," Energies, MDPI, vol. 14(9), pages 1-22, April.
    18. Saeid SHABANI, 2017. "Modelling and mapping of soil damage caused by harvesting in Caspian forests (Iran) using CART and RF data mining techniques," Journal of Forest Science, Czech Academy of Agricultural Sciences, vol. 63(9), pages 425-432.
    19. Schlembach, Christoph & Schmidt, Sascha L. & Schreyer, Dominik & Wunderlich, Linus, 2022. "Forecasting the Olympic medal distribution – A socioeconomic machine learning model," Technological Forecasting and Social Change, Elsevier, vol. 175(C).
    20. Wunderlich, Fabian & Memmert, Daniel, 2020. "Are betting returns a useful measure of accuracy in (sports) forecasting?," International Journal of Forecasting, Elsevier, vol. 36(2), pages 713-722.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:gnstxx:v:24:y:2012:i:4:p:993-1006. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/GNST20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.