IDEAS home Printed from https://ideas.repec.org/a/bpj/sagmbi/v7y2008i1n12.html
   My bibliography  Save this article

Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples

Author

Listed:
  • Binder Harald

    (Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg)

  • Schumacher Martin

    (Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg)

Abstract

The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.

Suggested Citation

  • Binder Harald & Schumacher Martin, 2008. "Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 7(1), pages 1-28, March.
  • Handle: RePEc:bpj:sagmbi:v:7:y:2008:i:1:n:12
    DOI: 10.2202/1544-6115.1346
    as

    Download full text from publisher

    File URL: https://doi.org/10.2202/1544-6115.1346
    Download Restriction: For access to full text, subscription to the journal or payment for the individual article is required.

    File URL: https://libkey.io/10.2202/1544-6115.1346?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Buhlmann P. & Yu B., 2003. "Boosting With the L2 Loss: Regression and Classification," Journal of the American Statistical Association, American Statistical Association, vol. 98, pages 324-339, January.
    2. Thomas A. Gerds & Martin Schumacher, 2007. "Efron-Type Measures of Prediction Error for Survival Analysis," Biometrics, The International Biometric Society, vol. 63(4), pages 1283-1287, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Lore Zumeta-Olaskoaga & Maximilian Weigert & Jon Larruskain & Eder Bikandi & Igor Setuain & Josean Lekue & Helmut Küchenhoff & Dae-Jin Lee, 2023. "Prediction of sports injuries in football: a recurrent time-to-event approach using regularized Cox models," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 107(1), pages 101-126, March.
    2. Stefanie Hieke & Axel Benner & Richard F Schlenk & Martin Schumacher & Lars Bullinger & Harald Binder, 2016. "Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-18, May.
    3. Sill, Martin & Hielscher, Thomas & Becker, Natalia & Zucknick, Manuela, 2014. "c060: Extended Inference with Lasso and Elastic-Net Regularized Cox and Generalized Linear Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 62(i05).
    4. Bernd Bischl & Julia Schiffner & Claus Weihs, 2013. "Benchmarking local classification methods," Computational Statistics, Springer, vol. 28(6), pages 2599-2619, December.
    5. Christine Porzelius & Martin Schumacher & Harald Binder, 2011. "The benefit of data-based model complexity selection via prediction error curves in time-to-event data," Computational Statistics, Springer, vol. 26(2), pages 293-302, June.
    6. Mogensen, Ulla B. & Ishwaran, Hemant & Gerds, Thomas A., 2012. "Evaluating Random Forests for Survival Analysis Using Prediction Error Curves," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 50(i11).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    2. Gerhard Tutz & Moritz Berger, 2018. "Tree-structured modelling of categorical predictors in generalized additive regression," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 737-758, September.
    3. Mittnik, Stefan & Robinzonov, Nikolay & Spindler, Martin, 2015. "Stock market volatility: Identifying major drivers and the nature of their impact," Journal of Banking & Finance, Elsevier, vol. 58(C), pages 1-14.
    4. Wang Zhu & Wang C.Y., 2010. "Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 9(1), pages 1-33, June.
    5. Martijn Kagie & Michiel Van Wezel, 2007. "Hedonic price models and indices based on boosting applied to the Dutch housing market," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 15(3‐4), pages 85-106, July.
    6. Hofner, Benjamin & Mayr, Andreas & Schmid, Matthias, 2016. "gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i01).
    7. Marra, Giampiero & Wood, Simon N., 2011. "Practical variable selection for generalized additive models," Computational Statistics & Data Analysis, Elsevier, vol. 55(7), pages 2372-2387, July.
    8. Robin Van Oirbeek & Emmanuel Lesaffre, 2018. "An Investigation of the Discriminatory Ability of the Clustering Effect of the Frailty Survival Model," Biostatistics and Biometrics Open Access Journal, Juniper Publishers Inc., vol. 6(3), pages 87-98, April.
    9. Ziwei Mei & Zhentao Shi & Peter C. B. Phillips, 2022. "The boosted HP filter is more general than you might think," Cowles Foundation Discussion Papers 2348, Cowles Foundation for Research in Economics, Yale University.
    10. R. Lehmann & K. Wohlrabe, 2016. "Looking into the black box of boosting: the case of Germany," Applied Economics Letters, Taylor & Francis Journals, vol. 23(17), pages 1229-1233, November.
    11. Kim, Hyun Hak & Swanson, Norman R., 2014. "Forecasting financial and macroeconomic variables using data reduction methods: New empirical evidence," Journal of Econometrics, Elsevier, vol. 178(P2), pages 352-367.
    12. Matthias Schmid & Thomas Hielscher & Thomas Augustin & Olaf Gefeller, 2011. "A Robust Alternative to the Schemper–Henderson Estimator of Prediction Error," Biometrics, The International Biometric Society, vol. 67(2), pages 524-535, June.
    13. Wolfgang Nierhaus & Timo Wollmershäuser, 2016. "ifo Konjunkturumfragen und Konjunkturanalyse: Band II," ifo Forschungsberichte, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, number 72.
    14. Fabio Trojani, 2007. "Accurate Short-Term Yield Curve Forecasting using Functional Gradient Descent," Journal of Financial Econometrics, Oxford University Press, vol. 5(4), pages 591-623, Fall.
    15. Stefanie Hieke & Axel Benner & Richard F Schlenk & Martin Schumacher & Lars Bullinger & Harald Binder, 2016. "Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-18, May.
    16. Panagiotelis, Anastasios & Gamakumara, Puwasala & Athanasopoulos, George & Hyndman, Rob J., 2023. "Probabilistic forecast reconciliation: Properties, evaluation and score optimisation," European Journal of Operational Research, Elsevier, vol. 306(2), pages 693-706.
    17. Robert Lehmann & Klaus Wohlrabe, 2017. "Boosting and regional economic forecasting: the case of Germany," Letters in Spatial and Resource Sciences, Springer, vol. 10(2), pages 161-175, July.
    18. Luciani, Matteo, 2014. "Forecasting with approximate dynamic factor models: The role of non-pervasive shocks," International Journal of Forecasting, Elsevier, vol. 30(1), pages 20-29.
    19. Ben Taieb, Souhaib & Hyndman, Rob J., 2014. "A gradient boosting approach to the Kaggle load forecasting competition," International Journal of Forecasting, Elsevier, vol. 30(2), pages 382-394.
    20. Klaus Wohlrabe & Teresa Buchen, 2014. "Assessing the Macroeconomic Forecasting Performance of Boosting: Evidence for the United States, the Euro Area and Germany," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 33(4), pages 231-242, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bpj:sagmbi:v:7:y:2008:i:1:n:12. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.degruyter.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.