IDEAS home Printed from https://ideas.repec.org/a/bla/istatr/v89y2021i2p382-401.html
   My bibliography  Save this article

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Author

Listed:
  • Jae‐Kwang Kim
  • Siu‐Ming Tam

Abstract

The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.

Suggested Citation

  • Jae‐Kwang Kim & Siu‐Ming Tam, 2021. "Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference," International Statistical Review, International Statistical Institute, vol. 89(2), pages 382-401, August.
  • Handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401
    DOI: 10.1111/insr.12434
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/insr.12434
    Download Restriction: no

    File URL: https://libkey.io/10.1111/insr.12434?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. repec:bla:istatr:v:83:y:2015:i:3:p:436-448 is not listed on IDEAS
    2. Li‐Chun Zhang, 2012. "Topics of statistical theory for register‐based statistics and data integration," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(1), pages 41-63, February.
    3. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    4. Jae Kwang Kim & J. N. K. Rao, 2009. "A unified approach to linearization variance estimation from survey data after imputation for item nonresponse," Biometrika, Biometrika Trust, vol. 96(4), pages 917-932.
    5. Jae Kwang Kim & Mingue Park, 2010. "Calibration Estimation in Survey Sampling," International Statistical Review, International Statistical Institute, vol. 78(1), pages 21-39, April.
    6. Niels Keiding & Thomas A. Louis, 2016. "Perils and potentials of self-selected entry to epidemiological studies and surveys," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 179(2), pages 319-376, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Medous, Estelle & Goga, Camelia & Ruiz-Gazen, Anne & Beaumont, Jean-François & Dessertaine, Alain & Puech, Pauline, 2022. "QR Prediction for Statistical Data Integration," TSE Working Papers 22-1344, Toulouse School of Economics (TSE).
    2. Ieva Burakauskaitė & Andrius Čiginas, 2023. "An Approach to Integrating a Non-Probability Sample in the Population Census," Mathematics, MDPI, vol. 11(8), pages 1-14, April.
    3. Natalia Golini & Paolo Righi, 2024. "Integrating probability and big non-probability samples data to produce Official Statistics," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 33(2), pages 555-580, April.
    4. Chien-Min Huang & F. Jay Breidt, 2023. "A dual-frame approach for estimation with respondent-driven samples," METRON, Springer;Sapienza Università di Roma, vol. 81(1), pages 65-81, April.
    5. Riccardo D’Alberto & Meri Raggi, 2024. "Integrating rather than collecting: statistical matching in the data flood era," Statistical Papers, Springer, vol. 65(4), pages 2135-2163, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lothian Jack & Holmberg Anders & Seyb Allyson, 2019. "An Evolutionary Schema for Using “it-is-what-it-is” Data in Official Statistics," Journal of Official Statistics, Sciendo, vol. 35(1), pages 137-165, March.
    2. Serena Pattaro & Nick Bailey & Chris Dibben, 2020. "Using Linked Longitudinal Administrative Data to Identify Social Disadvantage," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 147(3), pages 865-895, February.
    3. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    4. Fulvia Cerroni & Grazia Di Bella & Lorena Galiè, 2014. "Evaluating administrative data quality as inputof the statistical production process," Rivista di statistica ufficiale, ISTAT - Italian National Institute of Statistics - (Rome, ITALY), vol. 16(1-2), pages 117-146.
    5. Fabrizio Antolini & Laura Grassini, 2020. "Methodological problems in the economic measurement of tourism: the need for new sources of information," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(5), pages 1769-1780, December.
    6. Denis Devaud & Yves Tillé, 2019. "Deville and Särndal’s calibration: revisiting a 25-years-old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1033-1065, December.
    7. Stephanie Coffey, PhD. & Jaya Damineni & John Eltinge, PhD. & Anup Mathur, PhD. & Kayla Varela & Allison Zotti, 2023. "Some Open Questions on Multiple-Source Extensions of Adaptive-Survey Design Concepts and Methods," Working Papers 23-03, Center for Economic Studies, U.S. Census Bureau.
    8. Ton de Waal & Arnout van Delden & Sander Scholtus, 2020. "Multi‐source Statistics: Basic Situations and Methods," International Statistical Review, International Statistical Institute, vol. 88(1), pages 203-228, April.
    9. Yingli Pan & Wen Cai & Zhan Liu, 2022. "Inference for non-probability samples under high-dimensional covariate-adjusted superpopulation model," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(4), pages 955-979, October.
    10. J. N. K. Rao, 2021. "On Making Valid Inferences by Integrating Data from Surveys and Other Sources," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 83(1), pages 242-272, May.
    11. Xiaojun Mao & Zhonglei Wang & Shu Yang, 2023. "Matrix completion under complex survey sampling," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 75(3), pages 463-492, June.
    12. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    13. van Delden Arnout & Lorenc Boris & Struijs Peter & Zhang Li-Chun, 2018. "Letter to the Editor," Journal of Official Statistics, Sciendo, vol. 34(2), pages 573-580, June.
    14. Beręsewicz Maciej, 2019. "Correlates of Representation Errors in Internet Data Sources for Real Estate Market," Journal of Official Statistics, Sciendo, vol. 35(3), pages 509-529, September.
    15. Sixia Chen & David Haziza, 2017. "Multiply robust imputation procedures for zero-inflated distributions in surveys," METRON, Springer;Sapienza Università di Roma, vol. 75(3), pages 333-343, December.
    16. Shixiao Zhang & Peisong Han & Changbao Wu, 2023. "Calibration Techniques Encompassing Survey Sampling, Missing Data Analysis and Causal Inference," International Statistical Review, International Statistical Institute, vol. 91(2), pages 165-192, August.
    17. Ashley L. Buchanan & Michael G. Hudgens & Stephen R. Cole & Katie R. Mollan & Paul E. Sax & Eric S. Daar & Adaora A. Adimora & Joseph J. Eron & Michael J. Mugavero, 2018. "Generalizing evidence from randomized trials using inverse probability of sampling weights," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(4), pages 1193-1209, October.
    18. Lili Yu & Yichuan Zhao, 2022. "A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling," Stats, MDPI, vol. 5(4), pages 1-11, November.
    19. Lingxiao Wang & Barry I. Graubard & Hormuzd A. Katki & and Yan Li, 2020. "Improving external validity of epidemiologic cohort analyses: a kernel weighting approach," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 183(3), pages 1293-1311, June.
    20. Damião N. Da Silva & Li‐Chun Zhang, 2021. "A calibrated imputation method for secondary data analysis of survey data," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 48(1), pages 25-41, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/isiiinl.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.