IDEAS home Printed from https://ideas.repec.org/a/bla/istatr/v89y2021i2p382-401.html
   My bibliography  Save this article

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Author

Listed:
  • Jae‐Kwang Kim
  • Siu‐Ming Tam

Abstract

The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.

Suggested Citation

  • Jae‐Kwang Kim & Siu‐Ming Tam, 2021. "Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference," International Statistical Review, International Statistical Institute, vol. 89(2), pages 382-401, August.
  • Handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401
    DOI: 10.1111/insr.12434
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/insr.12434
    Download Restriction: no

    File URL: https://libkey.io/10.1111/insr.12434?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jae Kwang Kim & J. N. K. Rao, 2009. "A unified approach to linearization variance estimation from survey data after imputation for item nonresponse," Biometrika, Biometrika Trust, vol. 96(4), pages 917-932.
    2. Jae Kwang Kim & Mingue Park, 2010. "Calibration Estimation in Survey Sampling," International Statistical Review, International Statistical Institute, vol. 78(1), pages 21-39, April.
    3. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    4. Li‐Chun Zhang, 2012. "Topics of statistical theory for register‐based statistics and data integration," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(1), pages 41-63, February.
    5. Niels Keiding & Thomas A. Louis, 2016. "Perils and potentials of self-selected entry to epidemiological studies and surveys," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 179(2), pages 319-376, February.
    6. Siu-Ming Tam & Frederic Clarke, 2015. "Big Data, Official Statistics and Some Initiatives by the Australian Bureau of Statistics," International Statistical Review, International Statistical Institute, vol. 83(3), pages 436-448, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Ieva Burakauskaitė & Andrius Čiginas, 2023. "An Approach to Integrating a Non-Probability Sample in the Population Census," Mathematics, MDPI, vol. 11(8), pages 1-14, April.
    2. Medous, Estelle & Goga, Camelia & Ruiz-Gazen, Anne & Beaumont, Jean-François & Dessertaine, Alain & Puech, Pauline, 2022. "QR Prediction for Statistical Data Integration," TSE Working Papers 22-1344, Toulouse School of Economics (TSE).
    3. Chien-Min Huang & F. Jay Breidt, 2023. "A dual-frame approach for estimation with respondent-driven samples," METRON, Springer;Sapienza Università di Roma, vol. 81(1), pages 65-81, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lothian Jack & Holmberg Anders & Seyb Allyson, 2019. "An Evolutionary Schema for Using “it-is-what-it-is” Data in Official Statistics," Journal of Official Statistics, Sciendo, vol. 35(1), pages 137-165, March.
    2. John L. Czajka & Mathew Stange, "undated". "Transparency in the Reporting of Quality for Integrated Data: A Review of International Standards and Guidelines," Mathematica Policy Research Reports 984e8919667b48ab9aabcbbcb, Mathematica Policy Research.
    3. Serena Pattaro & Nick Bailey & Chris Dibben, 2020. "Using Linked Longitudinal Administrative Data to Identify Social Disadvantage," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 147(3), pages 865-895, February.
    4. David J. Hand, 2018. "Statistical challenges of administrative and transaction data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 555-605, June.
    5. Jae Kwang Kim & Zhonglei Wang & Zhengyuan Zhu & Nathan B. Cruze, 2018. "Combining Survey and Non-survey Data for Improved Sub-area Prediction Using a Multi-level Model," Journal of Agricultural, Biological and Environmental Statistics, Springer;The International Biometric Society;American Statistical Association, vol. 23(2), pages 175-189, June.
    6. Ana Beatriz Galvão & James Mitchell, 2023. "Real‐Time Perceptions of Historical GDP Data Uncertainty," Oxford Bulletin of Economics and Statistics, Department of Economics, University of Oxford, vol. 85(3), pages 457-481, June.
    7. Lingxiao Wang & Barry I. Graubard & Hormuzd A. Katki & and Yan Li, 2020. "Improving external validity of epidemiologic cohort analyses: a kernel weighting approach," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 183(3), pages 1293-1311, June.
    8. Peter G. M. van der Heijden & Maarten Cruyff & Paul A. Smith & Christine Bycroft & Patrick Graham & Nathaniel Matheson‐Dunning, 2022. "Multiple system estimation using covariates having missing values and measurement error: Estimating the size of the Māori population in New Zealand," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(1), pages 156-177, January.
    9. Damião N. Da Silva & Li‐Chun Zhang, 2021. "A calibrated imputation method for secondary data analysis of survey data," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 48(1), pages 25-41, March.
    10. Jiayin Zheng & Yingye Zheng & Li Hsu, 2022. "Re‐calibrating pure risk integrating individual data from two‐phase studies with external summary statistics," Biometrics, The International Biometric Society, vol. 78(4), pages 1515-1529, December.
    11. Skinner, Chris J., 2017. "Comments on the Rao and Fuller (2017) paper," LSE Research Online Documents on Economics 86537, London School of Economics and Political Science, LSE Library.
    12. Bakker Bart F.M. & Heijden Peter G.M. van der & Scholtus Sander, 2015. "Preface," Journal of Official Statistics, Sciendo, vol. 31(3), pages 349-355, September.
    13. Fulvia Cerroni & Grazia Di Bella & Lorena Galiè, 2014. "Evaluating administrative data quality as inputof the statistical production process," Rivista di statistica ufficiale, ISTAT - Italian National Institute of Statistics - (Rome, ITALY), vol. 16(1-2), pages 117-146.
    14. David McConnell & Conor Hickey & Norma Bargary & Lea Trela-Larsen & Cathal Walsh & Michael Barry & Roisin Adams, 2021. "Understanding the Challenges and Uncertainties of Seroprevalence Studies for SARS-CoV-2," IJERPH, MDPI, vol. 18(9), pages 1-19, April.
    15. Jonas F. Schenkel & Li‐Chun Zhang, 2022. "Adjusting misclassification using a second classifier with an external validation sample," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1882-1902, October.
    16. Marušić Zrinka & Kožul Marijana & Brozović Ivana, 2020. "Measuring non-commercial tourism traffic in Croatia: Challenges of using administrative data," Croatian Review of Economic, Business and Social Statistics, Sciendo, vol. 6(2), pages 69-81, December.
    17. Fabrizio Antolini & Laura Grassini, 2020. "Methodological problems in the economic measurement of tourism: the need for new sources of information," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(5), pages 1769-1780, December.
    18. Hamori, Shigeyuki & Motegi, Kaiji & Zhang, Zheng, 2019. "Calibration estimation of semiparametric copula models with data missing at random," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 85-109.
    19. Gołata Elżbieta, 2016. "Shift in Methodology and Population Census Quality," Statistics in Transition New Series, Polish Statistical Association, vol. 17(4), pages 631-658, December.
    20. Takumi Saegusa, 2020. "Confidence bands for a distribution function with merged data from multiple sources," Statistics in Transition New Series, Polish Statistical Association, vol. 21(4), pages 144-158, August.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:istatr:v:89:y:2021:i:2:p:382-401. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/isiiinl.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.