IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v113y2017icp19-37.html
   My bibliography  Save this article

Gradient boosting for high-dimensional prediction of rare events

Author

Listed:
  • Blagus, Rok
  • Lusa, Lara

Abstract

In clinical research the goal is often to correctly estimate the probability of an event. For this purpose several characteristics of the patients are measured and used to develop a prediction model which can be used to predict the class membership for future patients. Ensemble classifiers are combinations of many different classifiers and they can be useful because combining a set of classifiers can result in more accurate predictions. Gradient boosting is an ensemble classifier which was shown to perform well in the setting where the number of variables exceeds the number of samples (high-dimensional data), however it has not been evaluated for the prediction of rare events. It is demonstrated that Gradient boosting suffers from severe rare events bias, correctly classifying only a small proportion of samples from the rare class. The bias can be removed by using subsampling in combination with appropriate amount of shrinkage but only for a specific number of boosting iterations and for binomial loss function. It is shown that the number of boosting iterations where the rare events bias is removed cannot be estimated efficiently from the training data when the sample size is small. Therefore several corrections for the rare events bias of Gradient boosting are proposed and evaluated by using simulated and real high-dimensional data. It is demonstrated that the proposed corrections successfully remove the rare events bias and outperform the other ensemble classifiers that were considered. Large flexibility and high interpretability of the proposed methods is also illustrated.

Suggested Citation

  • Blagus, Rok & Lusa, Lara, 2017. "Gradient boosting for high-dimensional prediction of rare events," Computational Statistics & Data Analysis, Elsevier, vol. 113(C), pages 19-37.
  • Handle: RePEc:eee:csdana:v:113:y:2017:i:c:p:19-37
    DOI: 10.1016/j.csda.2016.07.016
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947316301803
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2016.07.016?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Andreas Mayr & Nora Fenske & Benjamin Hofner & Thomas Kneib & Matthias Schmid, 2012. "Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 61(3), pages 403-427, May.
    2. Tutz, Gerhard & Binder, Harald, 2007. "Boosting ridge regression," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6044-6059, August.
    3. Hand David J, 2008. "Breast Cancer Diagnosis from Proteomic Mass Spectrometry Data: A Comparative Evaluation," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 7(2), pages 1-23, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Riccardo De Bin & Vegard Grødem Stikbakke, 2023. "A boosting first-hitting-time model for survival analysis in high-dimensional settings," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 29(2), pages 420-440, April.
    2. Wang Zhu & Wang C.Y., 2010. "Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 9(1), pages 1-33, June.
    3. Xin Fang & Bo Fang & Chunfang Wang & Tian Xia & Matteo Bottai & Fang Fang & Yang Cao, 2019. "Comparison of Frequentist and Bayesian Generalized Additive Models for Assessing the Association between Daily Exposure to Fine Particles and Respiratory Mortality: A Simulation Study," IJERPH, MDPI, vol. 16(5), pages 1-20, March.
    4. Marra, Giampiero & Wood, Simon N., 2011. "Practical variable selection for generalized additive models," Computational Statistics & Data Analysis, Elsevier, vol. 55(7), pages 2372-2387, July.
    5. Ulrich, Matthias & Jahnke, Hermann & Langrock, Roland & Pesch, Robert & Senge, Robin, 2021. "Distributional regression for demand forecasting in e-grocery," European Journal of Operational Research, Elsevier, vol. 294(3), pages 831-842.
    6. D J Hand & F Zhou, 2010. "Evaluating models for classifying customers in retail banking collections," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 61(10), pages 1540-1547, October.
    7. Wang Chamont & Gevertz Jana L., 2016. "Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 15(4), pages 321-347, August.
    8. Belitz, Christiane & Lang, Stefan, 2008. "Simultaneous selection of variables and smoothing parameters in structured additive regression models," Computational Statistics & Data Analysis, Elsevier, vol. 53(1), pages 61-81, September.
    9. Groll Andreas & Kneib Thomas & Mayr Andreas & Schauberger Gunther, 2018. "On the dependency of soccer scores – a sparse bivariate Poisson model for the UEFA European football championship 2016," Journal of Quantitative Analysis in Sports, De Gruyter, vol. 14(2), pages 65-79, June.
    10. Stefanie Hieke & Axel Benner & Richard F Schlenk & Martin Schumacher & Lars Bullinger & Harald Binder, 2016. "Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-18, May.
    11. Groll, Andreas & Hambuckers, Julien & Kneib, Thomas & Umlauf, Nikolaus, 2019. "LASSO-type penalization in the framework of generalized additive models for location, scale and shape," Computational Statistics & Data Analysis, Elsevier, vol. 140(C), pages 59-73.
    12. Kneib, Thomas & Silbersdorff, Alexander & Säfken, Benjamin, 2023. "Rage Against the Mean – A Review of Distributional Regression Approaches," Econometrics and Statistics, Elsevier, vol. 26(C), pages 99-123.
    13. Faisal Zahid & Gerhard Tutz, 2013. "Multinomial logit models with implicit variable selection," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 7(4), pages 393-416, December.
    14. Bernardi, Mauro & Bottone, Marco & Petrella, Lea, 2018. "Bayesian quantile regression using the skew exponential power distribution," Computational Statistics & Data Analysis, Elsevier, vol. 126(C), pages 92-111.
    15. Zhao, Weihua & Lian, Heng & Song, Xinyuan, 2017. "Composite quantile regression for correlated data," Computational Statistics & Data Analysis, Elsevier, vol. 109(C), pages 15-33.
    16. Hendrik van der Wurp & Andreas Groll, 2023. "Introducing LASSO-type penalisation to generalised joint regression modelling for count data," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 107(1), pages 127-151, March.
    17. Colin Griesbach & Andreas Mayr & Elisabeth Bergherr, 2023. "Variable Selection and Allocation in Joint Models via Gradient Boosting Techniques," Mathematics, MDPI, vol. 11(2), pages 1-16, January.
    18. Gilbert, Ciaran & Browell, Jethro & McMillan, David, 2021. "Probabilistic access forecasting for improved offshore operations," International Journal of Forecasting, Elsevier, vol. 37(1), pages 134-150.
    19. Boyao Zhang & Tobias Hepp & Sonja Greven & Elisabeth Bergherr, 2022. "Adaptive step-length selection in gradient boosting for Gaussian location and scale models," Computational Statistics, Springer, vol. 37(5), pages 2295-2332, November.
    20. Simon N. Wood, 2020. "Inference and computation with generalized additive models and their extensions," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(2), pages 307-339, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:113:y:2017:i:c:p:19-37. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.