On the overestimation of random forest’s out-of-bag error

My bibliography Save this article

On the overestimation of random forest’s out-of-bag error

Author

Listed:

Silke Janitza
Roman Hornung

Registered:

Abstract

The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.

Suggested Citation

Silke Janitza & Roman Hornung, 2018. "On the overestimation of random forest’s out-of-bag error," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-31, August.

Handle: RePEc:plo:pone00:0201904
DOI: 10.1371/journal.pone.0201904

Download full text from publisher

References listed on IDEAS

Goldstein Benjamin A & Polley Eric C & Briggs Farren B. S., 2011. "Random Forests for Genetic Association Studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-34, July.

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Hapfelmeier, Alexander & Hornung, Roman & Haller, Bernhard, 2023. "Efficient permutation testing of variable importance measures by the example of random forests," Computational Statistics & Data Analysis, Elsevier, vol. 181(C).
Brédy, Jhemson & Gallichand, Jacques & Celicourt, Paul & Gumiere, Silvio José, 2020. "Water table depth forecasting in cranberry fields using two decision-tree-modeling approaches," Agricultural Water Management, Elsevier, vol. 233(C).
Ponomarenko, Alexey & Tatarintsev, Stas, 2023. "Incorporating financial development indicators into early warning systems," The Journal of Economic Asymmetries, Elsevier, vol. 27(C).
- Alexey Ponomarenko & Stas Tatarintsev, 2020. "Incorporating financial development indicators into early warning systems," Bank of Russia Working Paper Series wps58, Bank of Russia.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Wei, Pengfei & Lu, Zhenzhou & Song, Jingwen, 2015. "Variable importance analysis: A comprehensive review," Reliability Engineering and System Safety, Elsevier, vol. 142(C), pages 399-432.
Xianguo Ren & Haiqing Tian & Kai Zhao & Dapeng Li & Ziqing Xiao & Yang Yu & Fei Liu, 2022. "Research on pH Value Detection Method during Maize Silage Secondary Fermentation Based on Computer Vision," Agriculture, MDPI, vol. 12(10), pages 1-17, October.
Dinesh Reddy Vangumalli & Konstantinos Nikolopoulos & Konstantia Litsiou, 2019. "Clustering, Forecasting and Cluster Forecasting: using k-medoids, k-NNs and random forests for cluster selection," Working Papers 19016, Bangor Business School, Prifysgol Bangor University (Cymru / Wales).
Florian Marcel Nuţă & Alina Cristina Nuţă & Cristina Gabriela Zamfir & Stefan-Mihai Petrea & Dan Munteanu & Dragos Sebastian Cristea, 2021. "National Carbon Accounting—Analyzing the Impact of Urbanization and Energy-Related Factors upon CO 2 Emissions in Central–Eastern European Countries by Using Machine Learning Algorithms and Panel Data," Energies, MDPI, vol. 14(10), pages 1-23, May.
Michel Fuino & Andrey Ugarte Montero & Joël Wagner, 2022. "On the drivers of potential customers' interest in long‐term care insurance: Evidence from Switzerland," Risk Management and Insurance Review, American Risk and Insurance Association, vol. 25(3), pages 271-302, September.
Lauric A Ferrat & Marc Goodfellow & John R Terry, 2018. "Classifying dynamic transitions in high dimensional neural mass models: A random forest approach," PLOS Computational Biology, Public Library of Science, vol. 14(3), pages 1-27, March.
Sim Aaron & Tsagkrasoulis Dimosthenis & Montana Giovanni, 2013. "Random forests on distance matrices for imaging genetics studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 12(6), pages 757-786, December.
Cihan Şahin, 2023. "Predicting base station return on investment in the telecommunications industry: Machine‐learning approaches," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 30(1), pages 29-40, January.
Maria Angela Echeverry-Galvis & Jennifer K Peterson & Rajmonda Sulo-Caceres, 2014. "The Social Nestwork: Tree Structure Determines Nest Placement in Kenyan Weaverbird Colonies," PLOS ONE, Public Library of Science, vol. 9(2), pages 1-7, February.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0201904. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

On the overestimation of random forest’s out-of-bag error

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data