IDEAS home Printed from https://ideas.repec.org/p/cen/wpaper/21-35.html
   My bibliography  Save this paper

Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

Author

Listed:
  • John M. Abowd
  • Joelle Abramowitz
  • Margaret C. Levenstein
  • Kristin McCue
  • Dhiren Patki
  • Trivellore Raghunathan
  • Ann M. Rodgers
  • Matthew D. Shapiro
  • Nada Wasi
  • Dawn Zinsser

Abstract

This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents’ workplace characteristics.

Suggested Citation

  • John M. Abowd & Joelle Abramowitz & Margaret C. Levenstein & Kristin McCue & Dhiren Patki & Trivellore Raghunathan & Ann M. Rodgers & Matthew D. Shapiro & Nada Wasi & Dawn Zinsser, 2021. "Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning," Working Papers 21-35, Center for Economic Studies, U.S. Census Bureau.
  • Handle: RePEc:cen:wpaper:21-35
    as

    Download full text from publisher

    File URL: https://www2.census.gov/ces/wp/2021/CES-WP-21-35.pdf
    File Function: First version, 2021
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. P. Lahiri & Michael D. Larsen, 2005. "Regression Analysis With Linked Data," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 222-230, March.
    2. Brown, Charles & Medoff, James, 1989. "The Employer Size-Wage Effect," Journal of Political Economy, University of Chicago Press, vol. 97(5), pages 1027-1059, October.
    3. John M. Abowd & Bryce E. Stephens & Lars Vilhuber & Fredrik Andersson & Kevin L. McKinney & Marc Roemer & Simon Woodcock, 2009. "The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators," NBER Chapters, in: Producer Dynamics: New Evidence from Micro Data, pages 149-230, National Bureau of Economic Research, Inc.
    4. Roee Gutman & Christopher C. Afendulis & Alan M. Zaslavsky, 2013. "A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(501), pages 34-47, March.
    5. John M. Abowd & Martha H. Stinson, 2013. "Estimating Measurement Error in Annual Job Earnings: A Comparison of Survey and Administrative Data," The Review of Economics and Statistics, MIT Press, vol. 95(5), pages 1451-1467, December.
    6. Oi, Walter Y. & Idson, Todd L., 1999. "Firm size and wages," Handbook of Labor Economics, in: O. Ashenfelter & D. Card (ed.), Handbook of Labor Economics, edition 1, volume 3, chapter 33, pages 2165-2214, Elsevier.
    7. Rebecca C. Steorts & Rob Hall & Stephen E. Fienberg, 2016. "A Bayesian Approach to Graphical Record Linkage and Deduplication," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1660-1672, October.
    8. Martha J. Bailey & Connor Cole & Morgan Henderson & Catherine Massey, 2020. "How Well Do Automated Linking Methods Perform? Lessons from US Historical Data," Journal of Economic Literature, American Economic Association, vol. 58(4), pages 997-1044, December.
    9. Nicholas Bloom & Fatih Guvenen & Benjamin S. Smith & Jae Song & Till von Wachter, 2018. "The Disappearing Large-Firm Wage Premium," AEA Papers and Proceedings, American Economic Association, vol. 108, pages 317-322, May.
    10. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    11. Timothy Dunne & J. Bradford Jensen & Mark J. Roberts, 2009. "Producer Dynamics: New Evidence from Micro Data," NBER Books, National Bureau of Economic Research, Inc, number dunn05-1, March.
    12. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. John M. Abowd & Joelle Abramowitz & Margaret C. Levenstein & Kristin McCue & Dhiren Patki & Trivellore Raghunathan & Ann M. Rodgers & Matthew D. Shapiro & Nada Wasi, 2019. "Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data," Working Papers 19-08, Center for Economic Studies, U.S. Census Bureau.
    2. Nicholas Bloom & Scott Ohlmacher & Cristina Tello-Trillo & Melanie Wallskog, 2021. "Pay, Productivity and Management," Working Papers 21-31, Center for Economic Studies, U.S. Census Bureau.
    3. Hartmut Egger & Elke Jahn & Stefan Kornitzky, 2021. "How Does the Position in Business Group Hierarchies Affect Workers’ Wages?," Working Papers 213, Bavarian Graduate Program in Economics (BGPE).
    4. Henry Hyatt & Erika McEntarfer & John Haltiwanger, 2014. "Cyclical Reallocation of Workers Across Large and Small Employers," 2014 Meeting Papers 735, Society for Economic Dynamics.
    5. Tania Babina & Wenting Ma & Christian Moser & Paige Ouimet & Rebecca Zarutskie, 2019. "Pay, Employment, and Dynamics of Young Firms," Working Papers 19-23, Center for Economic Studies, U.S. Census Bureau.
    6. Emin Dinlersoz & Henry Hyatt & Hubert Janicki, 2019. "Who Works for Whom? Worker Sorting in a Model of Entrepreneurship with Heterogeneous Labor Markets," Review of Economic Dynamics, Elsevier for the Society for Economic Dynamics, vol. 34, pages 244-266, October.
    7. Brianna Cardiff-Hicks & Francine Lafontaine & Kathryn Shaw, 2015. "Do Large Modern Retailers Pay Premium Wages?," ILR Review, Cornell University, ILR School, vol. 68(3), pages 633-665, May.
    8. Melanie Jones & Ezgi Kaya, 2023. "The UK gender pay gap: Does firm size matter?," Economica, London School of Economics and Political Science, vol. 90(359), pages 937-952, July.
    9. Ouimet, Paige & Zarutskie, Rebecca, 2014. "Who works for startups? The relation between firm age, employee age, and growth," Journal of Financial Economics, Elsevier, vol. 112(3), pages 386-407.
    10. Jahn, Elke & Egger, Hartmut & Kornitzky, Stefan, 2021. "Does the Position in Business Group Hierarchies Affect Workers' Wages?," VfS Annual Conference 2021 (Virtual Conference): Climate Economics 242374, Verein für Socialpolitik / German Economic Association.
    11. Egger, Hartmut & Jahn, Elke & Kornitzky, Stefan, 2022. "How does the position in business group hierarchies affect workers’ wages?," Journal of Economic Behavior & Organization, Elsevier, vol. 194(C), pages 244-263.
    12. Jaime Arellano-Bover, 2024. "Career Consequences of Firm Heterogeneity for Young Workers: First Job and Firm Size," Journal of Labor Economics, University of Chicago Press, vol. 42(2), pages 549-589.
    13. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    14. Oxana Babecka Kucharcukova & Jan Bruha, 2016. "Nowcasting the Czech Trade Balance," Working Papers 2016/11, Czech National Bank.
    15. Carstensen, Kai & Heinrich, Markus & Reif, Magnus & Wolters, Maik H., 2020. "Predicting ordinary and severe recessions with a three-state Markov-switching dynamic factor model," International Journal of Forecasting, Elsevier, vol. 36(3), pages 829-850.
    16. Hou-Tai Chang & Ping-Huai Wang & Wei-Fang Chen & Chen-Ju Lin, 2022. "Risk Assessment of Early Lung Cancer with LDCT and Health Examinations," IJERPH, MDPI, vol. 19(8), pages 1-12, April.
    17. Margherita Giuzio, 2017. "Genetic algorithm versus classical methods in sparse index tracking," Decisions in Economics and Finance, Springer;Associazione per la Matematica, vol. 40(1), pages 243-256, November.
    18. Henrekson, Magnus & Johansson, Dan, 2010. "Firm Growth, Institutions and Structural Transformation," Ratio Working Papers 150, The Ratio Institute.
    19. Nicolaj N. Mühlbach, 2020. "Tree-based Synthetic Control Methods: Consequences of moving the US Embassy," CREATES Research Papers 2020-04, Department of Economics and Business Economics, Aarhus University.
    20. Wang, Qiao & Zhou, Wei & Cheng, Yonggang & Ma, Gang & Chang, Xiaolin & Miao, Yu & Chen, E, 2018. "Regularized moving least-square method and regularized improved interpolating moving least-square method with nonsingular moment matrices," Applied Mathematics and Computation, Elsevier, vol. 325(C), pages 120-145.

    More about this item

    JEL classification:

    • C13 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Estimation: General
    • C18 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Methodolical Issues: General
    • C81 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Methodology for Collecting, Estimating, and Organizing Microeconomic Data; Data Access

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:cen:wpaper:21-35. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Dawn Anderson (email available below). General contact details of provider: https://edirc.repec.org/data/cesgvus.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.