IDEAS home Printed from https://ideas.repec.org/a/bla/istatr/v89y2021i2p382-401.html
   My bibliography  Save this article

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Author

Listed:
  • Jae‐Kwang Kim
  • Siu‐Ming Tam

Abstract

The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.

Suggested Citation

  • Jae‐Kwang Kim & Siu‐Ming Tam, 2021. "Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference," International Statistical Review, International Statistical Institute, vol. 89(2), pages 382-401, August.
  • Handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401
    DOI: 10.1111/insr.12434
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/insr.12434
    Download Restriction: no

    File URL: https://libkey.io/10.1111/insr.12434?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Siu-Ming Tam & Frederic Clarke, 2015. "Big Data, Official Statistics and Some Initiatives by the Australian Bureau of Statistics," International Statistical Review, International Statistical Institute, vol. 83(3), pages 436-448, December.
    2. Li‐Chun Zhang, 2012. "Topics of statistical theory for register‐based statistics and data integration," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(1), pages 41-63, February.
    3. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    4. Jae Kwang Kim & J. N. K. Rao, 2009. "A unified approach to linearization variance estimation from survey data after imputation for item nonresponse," Biometrika, Biometrika Trust, vol. 96(4), pages 917-932.
    5. Jae Kwang Kim & Mingue Park, 2010. "Calibration Estimation in Survey Sampling," International Statistical Review, International Statistical Institute, vol. 78(1), pages 21-39, April.
    6. Niels Keiding & Thomas A. Louis, 2016. "Perils and potentials of self-selected entry to epidemiological studies and surveys," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 179(2), pages 319-376, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Medous, Estelle & Goga, Camelia & Ruiz-Gazen, Anne & Beaumont, Jean-François & Dessertaine, Alain & Puech, Pauline, 2022. "QR Prediction for Statistical Data Integration," TSE Working Papers 22-1344, Toulouse School of Economics (TSE).
    2. Chien-Min Huang & F. Jay Breidt, 2023. "A dual-frame approach for estimation with respondent-driven samples," METRON, Springer;Sapienza Università di Roma, vol. 81(1), pages 65-81, April.
    3. Ieva Burakauskaitė & Andrius Čiginas, 2023. "An Approach to Integrating a Non-Probability Sample in the Population Census," Mathematics, MDPI, vol. 11(8), pages 1-14, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lothian Jack & Holmberg Anders & Seyb Allyson, 2019. "An Evolutionary Schema for Using “it-is-what-it-is” Data in Official Statistics," Journal of Official Statistics, Sciendo, vol. 35(1), pages 137-165, March.
    2. John L. Czajka & Mathew Stange, "undated". "Transparency in the Reporting of Quality for Integrated Data: A Review of International Standards and Guidelines," Mathematica Policy Research Reports 984e8919667b48ab9aabcbbcb, Mathematica Policy Research.
    3. Serena Pattaro & Nick Bailey & Chris Dibben, 2020. "Using Linked Longitudinal Administrative Data to Identify Social Disadvantage," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 147(3), pages 865-895, February.
    4. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    5. Jae Kwang Kim & Zhonglei Wang & Zhengyuan Zhu & Nathan B. Cruze, 2018. "Combining Survey and Non-survey Data for Improved Sub-area Prediction Using a Multi-level Model," Journal of Agricultural, Biological and Environmental Statistics, Springer;The International Biometric Society;American Statistical Association, vol. 23(2), pages 175-189, June.
    6. Peter G. M. van der Heijden & Maarten Cruyff & Paul A. Smith & Christine Bycroft & Patrick Graham & Nathaniel Matheson‐Dunning, 2022. "Multiple system estimation using covariates having missing values and measurement error: Estimating the size of the Māori population in New Zealand," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(1), pages 156-177, January.
    7. Bakker Bart F.M. & Heijden Peter G.M. van der & Scholtus Sander, 2015. "Preface," Journal of Official Statistics, Sciendo, vol. 31(3), pages 349-355, September.
    8. Fulvia Cerroni & Grazia Di Bella & Lorena Galiè, 2014. "Evaluating administrative data quality as inputof the statistical production process," Rivista di statistica ufficiale, ISTAT - Italian National Institute of Statistics - (Rome, ITALY), vol. 16(1-2), pages 117-146.
    9. Jonas F. Schenkel & Li‐Chun Zhang, 2022. "Adjusting misclassification using a second classifier with an external validation sample," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1882-1902, October.
    10. Fabrizio Antolini & Laura Grassini, 2020. "Methodological problems in the economic measurement of tourism: the need for new sources of information," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(5), pages 1769-1780, December.
    11. Elżbieta Gołata, 2016. "Shift In Methodology And Population Census Quality," Statistics in Transition New Series, Polish Statistical Association, vol. 17(4), pages 631-658, December.
    12. Iacus Stefano M. & Salini Silvia & Siletti Elena & Porro Giuseppe, 2020. "Controlling for Selection Bias in Social Media Indicators through Official Statistics: a Proposal," Journal of Official Statistics, Sciendo, vol. 36(2), pages 315-338, June.
    13. Denis Devaud & Yves Tillé, 2019. "Deville and Särndal’s calibration: revisiting a 25-years-old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1033-1065, December.
    14. Stephanie Coffey, PhD. & Jaya Damineni & John Eltinge, PhD. & Anup Mathur, PhD. & Kayla Varela & Allison Zotti, 2023. "Some Open Questions on Multiple-Source Extensions of Adaptive-Survey Design Concepts and Methods," Working Papers 23-03, Center for Economic Studies, U.S. Census Bureau.
    15. Laureti Tiziana & Polidoro Federico, 2022. "Using Scanner Data for Computing Consumer Spatial Price Indexes at Regional Level: An Empirical Application for Grocery Products in Italy," Journal of Official Statistics, Sciendo, vol. 38(1), pages 23-56, March.
    16. Li-Chun Zhang & Ib Thomsen & Øyvin Kleven, 2013. "On the Use of Auxiliary and Paradata for Dealing With Non-sampling Errors in Household Surveys," International Statistical Review, International Statistical Institute, vol. 81(2), pages 270-288, August.
    17. Gelein, Brigitte & Haziza, David & Causeur, David, 2014. "Preserving relationships between variables with MIVQUE based imputation for missing survey data," Journal of Multivariate Analysis, Elsevier, vol. 131(C), pages 197-208.
    18. Paul Allin & David J. Hand, 2017. "New statistics for old?—measuring the wellbeing of the UK," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 180(1), pages 3-43, January.
    19. Ton de Waal & Arnout van Delden & Sander Scholtus, 2020. "Multi‐source Statistics: Basic Situations and Methods," International Statistical Review, International Statistical Institute, vol. 88(1), pages 203-228, April.
    20. Elżbieta Gołata, 2015. "Sae Education Challenges To Academics And Nsi," Statistics in Transition New Series, Polish Statistical Association, vol. 16(4), pages 611-630, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/isiiinl.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.