IDEAS home Printed from https://ideas.repec.org/a/spr/stpapr/v65y2024i2d10.1007_s00362-022-01386-w.html
   My bibliography  Save this article

A review on design inspired subsampling for big data

Author

Listed:
  • Jun Yu

    (Beijing Institute of Technology)

  • Mingyao Ai

    (Peking University)

  • Zhiqiang Ye

    (Peking University)

Abstract

Subsampling focuses on selecting a subsample that can efficiently sketch the information of the original data in terms of statistical inference. It provides a powerful tool in big data analysis and gains the attention of data scientists in recent years. In this review, some state-of-the-art subsampling methods inspired by statistical design are summarized. Three types of designs, namely optimal design, orthogonal design, and space filling design, have shown their great potential in subsampling for different objectives. The relationships between experimental designs and the related subsampling approaches are discussed. Specifically, two major families of design inspired subsampling techniques are presented. The first aims to select a subsample in accordance with some optimal design criteria. The second tries to find a subsample that meets some design requirements, including balancing, orthogonality, and uniformity. Simulated and real data examples are provided to compare these methods empirically.

Suggested Citation

  • Jun Yu & Mingyao Ai & Zhiqiang Ye, 2024. "A review on design inspired subsampling for big data," Statistical Papers, Springer, vol. 65(2), pages 467-510, April.
  • Handle: RePEc:spr:stpapr:v:65:y:2024:i:2:d:10.1007_s00362-022-01386-w
    DOI: 10.1007/s00362-022-01386-w
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00362-022-01386-w
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00362-022-01386-w?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Zhijian He & Art B. Owen, 2016. "Extensible grids: uniform sampling on a space filling curve," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(4), pages 917-931, September.
    2. repec:hal:spmain:info:hdl:2441/64itsev5509q8aa5mrbhi0g0b6 is not listed on IDEAS
    3. Victor Chernozhukov & Alfred Galichon & Marc Hallin & Marc Henry, 2014. "Monge-Kantorovich Depth, Quantiles, Ranks, and Signs," Papers 1412.8434, arXiv.org, revised Sep 2015.
    4. repec:spo:wpmain:info:hdl:2441/64itsev5509q8aa5mrbhi0g0b6 is not listed on IDEAS
    5. Sokbae Lee & Serena Ng, 2020. "An Econometric Perspective on Algorithmic Subsampling," Annual Review of Economics, Annual Reviews, vol. 12(1), pages 45-80, August.
    6. D. Pfeffermann & C. J. Skinner & D. J. Holmes & H. Goldstein & J. Rasbash, 1998. "Weighting for unequal selection probabilities in multilevel models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 60(1), pages 23-40.
    7. Xiong, Shifeng & Li, Guoying, 2008. "Some results on the convergence of conditional distributions," Statistics & Probability Letters, Elsevier, vol. 78(18), pages 3249-3253, December.
    8. Cheng Meng & Xinlian Zhang & Jingyi Zhang & Wenxuan Zhong & Ping Ma, 2020. "More efficient approximation of smoothing splines via space-filling basis selection," Biometrika, Biometrika Trust, vol. 107(3), pages 723-735.
    9. Jun Yu & HaiYing Wang, 2022. "Subdata selection algorithm for linear model discrimination," Statistical Papers, Springer, vol. 63(6), pages 1883-1906, December.
    10. Haojie Ren & Changliang Zou & Nan Chen & Runze Li, 2022. "Large-Scale Datastreams Surveillance via Pattern-Oriented-Sampling," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 117(538), pages 794-808, April.
    11. Matias Quiroz & Robert Kohn & Mattias Villani & Minh-Ngoc Tran, 2019. "Speeding Up MCMC by Efficient Data Subsampling," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(526), pages 831-843, April.
    12. Fred J. Hickernell, 2002. "Uniform designs limit aliasing," Biometrika, Biometrika Trust, vol. 89(4), pages 893-904, December.
    13. Boivin, Jean & Ng, Serena, 2006. "Are more data always better for factor analysis?," Journal of Econometrics, Elsevier, vol. 132(1), pages 169-194, May.
    14. Zhang, Haixiang & Wang, HaiYing, 2021. "Distributed subdata selection for big data via sampling-based approach," Computational Statistics & Data Analysis, Elsevier, vol. 153(C).
    15. Kuang, Kun & Xiong, Ruoxuan & Cui, Peng & Athey, Susan & Li, Bo, 2018. "Stable Predictions across Unknown Environments," Research Papers 3695, Stanford University, Graduate School of Business.
    16. Jun Yu & HaiYing Wang & Mingyao Ai & Huiming Zhang, 2022. "Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 117(537), pages 265-276, January.
    17. Victor Chernozhukov & Alfred Galichon & Marc Hallin & Marc Henry, 2014. "Monge-Kantorovich Depth, Quantiles, Ranks, and Signs," Papers 1412.8434, arXiv.org, revised Sep 2015.
    18. Yaping Wang & Fasheng Sun & Hongquan Xu, 2022. "On Design Orthogonality, Maximin Distance, and Projection Uniformity for Computer Experiments," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 117(537), pages 375-385, January.
    19. Haiying Wang & Yanyuan Ma, 2021. "Optimal subsampling for quantile regression in big data," Biometrika, Biometrika Trust, vol. 108(1), pages 99-112.
    20. Yaping Wang & Jianfeng Yang & Hongquan Xu, 2018. "On the connection between maximin distance designs and orthogonal designs," Biometrika, Biometrika Trust, vol. 105(2), pages 471-477.
    21. Serena Ng, 2017. "Opportunities and Challenges: Lessons from Analyzing Terabytes of Scanner Data," NBER Working Papers 23673, National Bureau of Economic Research, Inc.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Min Ren & Shengli Zhao & Mingqiu Wang & Xinbei Zhu, 2024. "Robust optimal subsampling based on weighted asymmetric least squares," Statistical Papers, Springer, vol. 65(4), pages 2221-2251, June.
    2. Jun Yu & HaiYing Wang, 2022. "Subdata selection algorithm for linear model discrimination," Statistical Papers, Springer, vol. 63(6), pages 1883-1906, December.
    3. Yue Chao & Lei Huang & Xuejun Ma & Jiajun Sun, 2024. "Optimal subsampling for modal regression in massive data," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 87(4), pages 379-409, May.
    4. Tao Zou & Xian Li & Xuan Liang & Hansheng Wang, 2021. "On the Subbagging Estimation for Massive Data," Papers 2103.00631, arXiv.org.
    5. Laurent Ferrara & Anna Simoni, 2023. "When are Google Data Useful to Nowcast GDP? An Approach via Preselection and Shrinkage," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 41(4), pages 1188-1202, October.
    6. Hongjian Shi & Mathias Drton & Marc Hallin & Fang Han, 2023. "Semiparametrically Efficient Tests of Multivariate Independence Using Center-Outward Quadrant, Spearman, and Kendall Statistics," Working Papers ECARES 2023-03, ULB -- Universite Libre de Bruxelles.
    7. Jun Yu & Jiaqi Liu & HaiYing Wang, 2023. "Information-based optimal subdata selection for non-linear models," Statistical Papers, Springer, vol. 64(4), pages 1069-1093, August.
    8. Gunsilius, Florian F., 2023. "A condition for the identification of multivariate models with binary instruments," Journal of Econometrics, Elsevier, vol. 235(1), pages 220-238.
    9. Florian F Gunsilius, 2025. "A primer on optimal transport for causal inference with observational data," Papers 2503.07811, arXiv.org, revised Mar 2025.
    10. Tianzhen Wang & Haixiang Zhang, 2022. "Optimal subsampling for multiplicative regression with massive data," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 76(4), pages 418-449, November.
    11. Alberto González-Sanz & Marc Hallin & Bodhisattva Sen, 2023. "Monotone Measure-Preserving Maps in Hilbert Spaces: Existence, Uniqueness, and Stability," Working Papers ECARES 2023-10, ULB -- Universite Libre de Bruxelles.
    12. Deng, Jiayi & Huang, Danyang & Ding, Yi & Zhu, Yingqiu & Jing, Bingyi & Zhang, Bo, 2024. "Subsampling spectral clustering for stochastic block models in large-scale networks," Computational Statistics & Data Analysis, Elsevier, vol. 189(C).
    13. Olivier Paul Faugeras & Ludger Rüschendorf, 2021. "Functional, randomized and smoothed multivariate quantile regions," Post-Print hal-03352330, HAL.
    14. Hudecová, Šárka & Šiman, Miroslav, 2024. "Stochastic hyperplane-based ranks and their use in multivariate portmanteau tests," Journal of Multivariate Analysis, Elsevier, vol. 204(C).
    15. Marcel Klatt & Axel Munk & Yoav Zemel, 2022. "Limit laws for empirical optimal solutions in random linear programs," Annals of Operations Research, Springer, vol. 315(1), pages 251-278, August.
    16. Sokbae Lee & Serena Ng, 2020. "An Econometric Perspective on Algorithmic Subsampling," Annual Review of Economics, Annual Reviews, vol. 12(1), pages 45-80, August.
    17. Marc Hallin & Hang Liu, 2022. "Center-outward Rank- and Sign-based VARMA Portmanteau Tests," Working Papers ECARES 2022-27, ULB -- Universite Libre de Bruxelles.
    18. Serena Ng & Susannah Scanlan, 2023. "Constructing High Frequency Economic Indicators by Imputation," Papers 2303.01863, arXiv.org, revised Oct 2023.
    19. Alfred Galichon, 2021. "The Unreasonable Effectiveness of Optimal Transport in Economics," SciencePo Working papers Main hal-03936221, HAL.
    20. Bing Guo & Xiao-Rong Li & Min-Qian Liu & Xue Yang, 2023. "Construction of orthogonal general sliced Latin hypercube designs," Statistical Papers, Springer, vol. 64(3), pages 987-1014, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:stpapr:v:65:y:2024:i:2:d:10.1007_s00362-022-01386-w. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.