IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0007087.html
   My bibliography  Save this article

An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

Author

Listed:
  • Yuliya V Karpievitch
  • Elizabeth G Hill
  • Anthony P Leclerc
  • Alan R Dabney
  • Jonas S Almeida

Abstract

Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.

Suggested Citation

  • Yuliya V Karpievitch & Elizabeth G Hill & Anthony P Leclerc & Alan R Dabney & Jonas S Almeida, 2009. "An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++," PLOS ONE, Public Library of Science, vol. 4(9), pages 1-10, September.
  • Handle: RePEc:plo:pone00:0007087
    DOI: 10.1371/journal.pone.0007087
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007087
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0007087&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0007087?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Oliver Hümbelin & Lukas Hobi & Robert Fluder, 2021. "Rich Cities, Poor Countryside? Social Structure of the Poor and Poverty Risks in Urban and Rural Places in an Affluent Country. An Administrative Data based Analysis using Random Forest," University of Bern Social Sciences Working Papers 40, University of Bern, Department of Social Sciences, revised 10 Nov 2021.
    2. Werner Adler & Sergej Potapov & Berthold Lausen, 2011. "Classification of repeated measurements data using tree-based ensemble methods," Computational Statistics, Springer, vol. 26(2), pages 355-369, June.
    3. Adler, Werner & Brenning, Alexander & Potapov, Sergej & Schmid, Matthias & Lausen, Berthold, 2011. "Ensemble classification of paired data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1933-1941, May.
    4. Honoria Ocagli & Daniele Bottigliengo & Giulia Lorenzoni & Danila Azzolina & Aslihan S. Acar & Silvia Sorgato & Lucia Stivanello & Mario Degan & Dario Gregori, 2021. "A Machine Learning Approach for Investigating Delirium as a Multifactorial Syndrome," IJERPH, MDPI, vol. 18(13), pages 1-13, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0007087. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.