IDEAS home Printed from https://ideas.repec.org/a/bla/jorssb/v76y2014i4p795-816.html
   My bibliography  Save this article

A scalable bootstrap for massive data

Author

Listed:
  • Ariel Kleiner
  • Ameet Talwalkar
  • Purnamrita Sarkar
  • Michael I. Jordan

Abstract

type="main" xml:id="rssb12050-abs-0001"> The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the ‘bag of little bootstraps’ (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the m out of n bootstrap and subsampling. In addition, we present results from a large-scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data.

Suggested Citation

  • Ariel Kleiner & Ameet Talwalkar & Purnamrita Sarkar & Michael I. Jordan, 2014. "A scalable bootstrap for massive data," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 76(4), pages 795-816, September.
  • Handle: RePEc:bla:jorssb:v:76:y:2014:i:4:p:795-816
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1111/rssb.2014.76.issue-4
    Download Restriction: Access to full text is restricted to subscribers.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Batuhan Özkan & Coşkun Parim & Erhan Çene, 2023. "Predicting Countries’ Development Levels Using the Decision Tree and Random Forest Methods," EKOIST Journal of Econometrics and Statistics, Istanbul University, Faculty of Economics, vol. 0(38), pages 87-104, June.
    2. Guangbao Guo & Yue Sun & Xuejun Jiang, 2020. "A partitioned quasi-likelihood for distributed statistical inference," Computational Statistics, Springer, vol. 35(4), pages 1577-1596, December.
    3. Milica Maricic & Jose A. Egea & Veljko Jeremic, 2019. "A Hybrid Enhanced Scatter Search—Composite I-Distance Indicator (eSS-CIDI) Optimization Approach for Determining Weights Within Composite Indicators," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 144(2), pages 497-537, July.
    4. Badruddoza, Syed & Amin, Modhurima & McCluskey, Jill, 2019. "Assessing the Importance of an Attribute in a Demand SystemStructural Model versus Machine Learning," Working Papers 2019-5, School of Economic Sciences, Washington State University.
    5. Vaughan, Gregory, 2020. "Efficient big data model selection with applications to fraud detection," International Journal of Forecasting, Elsevier, vol. 36(3), pages 1116-1127.
    6. Xingcai Zhou & Zhaoyang Jing & Chao Huang, 2024. "Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression," Mathematics, MDPI, vol. 12(5), pages 1-54, February.
    7. Benjamin Lu & Jia Wan & Derek Ouyang & Jacob Goldin & Daniel E. Ho, 2024. "Quantifying the Uncertainty of Imputed Demographic Disparity Estimates: The Dual Bootstrap," NBER Chapters, in: Race, Ethnicity, and Economic Statistics for the 21st Century, National Bureau of Economic Research, Inc.
    8. Gérard Biau & Erwan Scornet, 2016. "A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 197-227, June.
    9. Yang, Xinfeng & Yan, Xiaodong & Huang, Jian, 2019. "High-dimensional integrative analysis with homogeneity and sparsity recovery," Journal of Multivariate Analysis, Elsevier, vol. 174(C).
    10. Shi, Chengchun & Lu, Wenbin & Song, Rui, 2018. "A massive data framework for M-estimators with cubic-rate," LSE Research Online Documents on Economics 102111, London School of Economics and Political Science, LSE Library.
    11. Xuejun Ma & Shaochen Wang & Wang Zhou, 2022. "Statistical inference in massive datasets by empirical likelihood," Computational Statistics, Springer, vol. 37(3), pages 1143-1164, July.
    12. Olhede, Sofia C. & Wolfe, Patrick J., 2018. "The future of statistics and data science," Statistics & Probability Letters, Elsevier, vol. 136(C), pages 46-50.
    13. Wang, Xiaoqian & Kang, Yanfei & Hyndman, Rob J. & Li, Feng, 2023. "Distributed ARIMA models for ultra-long time series," International Journal of Forecasting, Elsevier, vol. 39(3), pages 1163-1184.
    14. Dean Eckles & Maurits Kaptein, 2019. "Bootstrap Thompson Sampling and Sequential Decision Problems in the Behavioral Sciences," SAGE Open, , vol. 9(2), pages 21582440198, June.
    15. Lee, JooChul & Wang, HaiYing & Schifano, Elizabeth D., 2020. "Online updating method to correct for measurement error in big data streams," Computational Statistics & Data Analysis, Elsevier, vol. 149(C).
    16. Changgee Chang & Zhiqi Bu & Qi Long, 2023. "CEDAR: communication efficient distributed analysis for regressions," Biometrics, The International Biometric Society, vol. 79(3), pages 2357-2369, September.
    17. Amalan Mahendran & Helen Thompson & James M. McGree, 2023. "A model robust subsampling approach for Generalised Linear Models in big data settings," Statistical Papers, Springer, vol. 64(4), pages 1137-1157, August.
    18. Baihua He & Yanyan Liu & Guosheng Yin & Yuanshan Wu, 2023. "Model aggregation for doubly divided data with large size and large dimension," Computational Statistics, Springer, vol. 38(1), pages 509-529, March.
    19. Villoria, Nelson B. & Liu, Jing, 2018. "Using spatially explicit data to improve our understanding of land supply responses: An application to the cropland effects of global sustainable irrigation in the Americas," Land Use Policy, Elsevier, vol. 75(C), pages 411-419.
    20. Fang, Jianglin, 2023. "A split-and-conquer variable selection approach for high-dimensional general semiparametric models with massive data," Journal of Multivariate Analysis, Elsevier, vol. 194(C).
    21. Beate Franke & Jean-FRANçois Plante & Ribana Roscher & En-shiun Annie Lee & Cathal Smyth & Armin Hatefi & Fuqi Chen & Einat Gil & Alexander Schwing & Alessandro Selvitella & Michael M. Hoffman & Roger, 2016. "Statistical Inference, Learning and Models in Big Data," International Statistical Review, International Statistical Institute, vol. 84(3), pages 371-389, December.
    22. Zhang, Likun & Castillo, Enrique del & Berglund, Andrew J. & Tingley, Martin P. & Govind, Nirmal, 2020. "Computing confidence intervals from massive data via penalized quantile smoothing splines," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    23. Tang, Lu & Zhou, Ling & Song, Peter X.-K., 2020. "Distributed simultaneous inference in generalized linear models via confidence distribution," Journal of Multivariate Analysis, Elsevier, vol. 176(C).
    24. Ma, Xuejun & Wang, Shaochen & Zhou, Wang, 2021. "Testing multivariate quantile by empirical likelihood," Journal of Multivariate Analysis, Elsevier, vol. 182(C).
    25. Mercè Crosas & Gary King & James Honaker & Latanya Sweeney, 2015. "Automating Open Science for Big Data," The ANNALS of the American Academy of Political and Social Science, , vol. 659(1), pages 260-273, May.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssb:v:76:y:2014:i:4:p:795-816. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.