IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0201904.html
   My bibliography  Save this article

On the overestimation of random forest’s out-of-bag error

Author

Listed:
  • Silke Janitza
  • Roman Hornung

Abstract

The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.

Suggested Citation

  • Silke Janitza & Roman Hornung, 2018. "On the overestimation of random forest’s out-of-bag error," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-31, August.
  • Handle: RePEc:plo:pone00:0201904
    DOI: 10.1371/journal.pone.0201904
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0201904
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0201904&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0201904?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Goldstein Benjamin A & Polley Eric C & Briggs Farren B. S., 2011. "Random Forests for Genetic Association Studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-34, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hapfelmeier, Alexander & Hornung, Roman & Haller, Bernhard, 2023. "Efficient permutation testing of variable importance measures by the example of random forests," Computational Statistics & Data Analysis, Elsevier, vol. 181(C).
    2. Brédy, Jhemson & Gallichand, Jacques & Celicourt, Paul & Gumiere, Silvio José, 2020. "Water table depth forecasting in cranberry fields using two decision-tree-modeling approaches," Agricultural Water Management, Elsevier, vol. 233(C).
    3. Ponomarenko, Alexey & Tatarintsev, Stas, 2023. "Incorporating financial development indicators into early warning systems," The Journal of Economic Asymmetries, Elsevier, vol. 27(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wei, Pengfei & Lu, Zhenzhou & Song, Jingwen, 2015. "Variable importance analysis: A comprehensive review," Reliability Engineering and System Safety, Elsevier, vol. 142(C), pages 399-432.
    2. Xianguo Ren & Haiqing Tian & Kai Zhao & Dapeng Li & Ziqing Xiao & Yang Yu & Fei Liu, 2022. "Research on pH Value Detection Method during Maize Silage Secondary Fermentation Based on Computer Vision," Agriculture, MDPI, vol. 12(10), pages 1-17, October.
    3. Dinesh Reddy Vangumalli & Konstantinos Nikolopoulos & Konstantia Litsiou, 2019. "Clustering, Forecasting and Cluster Forecasting: using k-medoids, k-NNs and random forests for cluster selection," Working Papers 19016, Bangor Business School, Prifysgol Bangor University (Cymru / Wales).
    4. Florian Marcel Nuţă & Alina Cristina Nuţă & Cristina Gabriela Zamfir & Stefan-Mihai Petrea & Dan Munteanu & Dragos Sebastian Cristea, 2021. "National Carbon Accounting—Analyzing the Impact of Urbanization and Energy-Related Factors upon CO 2 Emissions in Central–Eastern European Countries by Using Machine Learning Algorithms and Panel Data," Energies, MDPI, vol. 14(10), pages 1-23, May.
    5. Michel Fuino & Andrey Ugarte Montero & Joël Wagner, 2022. "On the drivers of potential customers' interest in long‐term care insurance: Evidence from Switzerland," Risk Management and Insurance Review, American Risk and Insurance Association, vol. 25(3), pages 271-302, September.
    6. Lauric A Ferrat & Marc Goodfellow & John R Terry, 2018. "Classifying dynamic transitions in high dimensional neural mass models: A random forest approach," PLOS Computational Biology, Public Library of Science, vol. 14(3), pages 1-27, March.
    7. Sim Aaron & Tsagkrasoulis Dimosthenis & Montana Giovanni, 2013. "Random forests on distance matrices for imaging genetics studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 12(6), pages 757-786, December.
    8. Cihan Şahin, 2023. "Predicting base station return on investment in the telecommunications industry: Machine‐learning approaches," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 30(1), pages 29-40, January.
    9. Maria Angela Echeverry-Galvis & Jennifer K Peterson & Rajmonda Sulo-Caceres, 2014. "The Social Nestwork: Tree Structure Determines Nest Placement in Kenyan Weaverbird Colonies," PLOS ONE, Public Library of Science, vol. 9(2), pages 1-7, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0201904. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.