IDEAS home Printed from https://ideas.repec.org/a/taf/vhimxx/v53y2020i2p94-111.html

Linking individuals across historical sources: A fully automated approach

Author

Listed:
  • Ran Abramitzky
  • Roy Mill
  • Santiago Pérez

Abstract

Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.

Suggested Citation

  • Ran Abramitzky & Roy Mill & Santiago Pérez, 2020. "Linking individuals across historical sources: A fully automated approach," Historical Methods: A Journal of Quantitative and Interdisciplinary History, Taylor & Francis Journals, vol. 53(2), pages 94-111, April.
  • Handle: RePEc:taf:vhimxx:v:53:y:2020:i:2:p:94-111
    DOI: 10.1080/01615440.2018.1543034
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/01615440.2018.1543034
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/01615440.2018.1543034?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to look for a different version below or

    for a different version of it.

    Other versions of this item:

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hector Blanco & Noémie Sportiche, 2025. "There Goes the Neighborhood? The Local Impacts of State Policies That Override Municipal Zoning," CESifo Working Paper Series 12140, CESifo.
    2. Valerie Michelman & Joseph Price & Seth D Zimmerman, 2022. "Old Boys’ Clubs and Upward Mobility Among the Educational Elite [Do Immigrants Assimilate More Slowly Today Than in the Past?]," The Quarterly Journal of Economics, Oxford University Press, vol. 137(2), pages 845-909.
    3. A‐Sung Hong, 2024. "Beyond the finish line: How losing in patent race drives post‐race innovation," Strategic Management Journal, Wiley Blackwell, vol. 45(5), pages 968-993, May.
    4. Narciso, Gaia & Severgnini, Battista, 2023. "The deep roots of rebellion," Journal of Development Economics, Elsevier, vol. 160(C).
    5. Zhu, Ziming, 2022. "Like father like son? Intergenerational immobility in England, 1851-1911," LSE Research Online Documents on Economics 117588, London School of Economics and Political Science, LSE Library.
    6. Bennett, Robert J. & Montebruno, Piero & Van Lieshout, Carry & Smith, Harry, 2022. "Business entry and exit: career changes of proprietors in England and Wales (1851-81) using record-linkage," LSE Research Online Documents on Economics 113867, London School of Economics and Political Science, LSE Library.
    7. Tyler Anbinder & Dylan Connor & Cormac Ó Gráda & Simone Wegge, 2021. "The Problem of False Positives in Automated Census Linking: Evidence from Nineteenth-Century New York's Irish Immigrants," Working Papers 202114, School of Economics, University College Dublin.
    8. Michele Baggio & Metin Cosgel, 2023. "Racial Diversity and Team Performance: Evidence from the American Offshore Whaling Industry," Working papers 2023-04, University of Connecticut, Department of Economics, revised Feb 2024.
    9. Joseph Price & Kasey Buckles & Jacob Van Leeuwen & Isaac Riley, 2019. "Combining Family History and Machine Learning to Link Historical Records," NBER Working Papers 26227, National Bureau of Economic Research, Inc.
    10. Bergeaud, Antonin & Verluise, Cyril, 2024. "A new dataset to study a century of innovation in Europe and in the US," Research Policy, Elsevier, vol. 53(1).
    11. repec:osf:socarx:q79ye_v1 is not listed on IDEAS
    12. Yannick Dupraz & Andreas Ferrara, 2025. "Fatherless: The Long-Term Effects of Losing a Father in the U.S. Civil War," Journal of Human Resources, University of Wisconsin Press, vol. 60(4), pages 1126-1174.
    13. Dahl, Christian M. & Johansen, Torben S.D. & Sørensen, Emil N. & Wittrock, Simon, 2023. "HANA: A handwritten name database for offline handwritten text recognition," Explorations in Economic History, Elsevier, vol. 87(C).
    14. Anna Aizer & Shari Eli & Adriana Lleras-Muney & Keyoung Lee, 2020. "Do Youth Employment Programs Work? Evidence from the New Deal," NBER Working Papers 27103, National Bureau of Economic Research, Inc.
    15. Luque de Haro, Víctor A. & Pujadas-Mora, Joana M. & García-Gómez, José J., 2021. "Inequality in mortality in pre-industrial southern Europe during an epidemic episode: socio-economic determinants (eighteenth - nineteenth centuries Spain)," Economics & Human Biology, Elsevier, vol. 40(C).
    16. Eric S. M. Protzer & Sultan Orazbayev & Andres Gomez-Lievano & Matte Hartog & Frank Neffke, 2024. "A New Algorithm to Efficiently Match U.S. Census Records and Balance Representativity with Match Quality," Growth Lab Working Papers 238, Harvard's Growth Lab.
    17. Price, Joseph & Buckles, Kasey & Van Leeuwen, Jacob & Riley, Isaac, 2021. "Combining family history and machine learning to link historical records: The Census Tree data set," Explorations in Economic History, Elsevier, vol. 80(C).
    18. Alexander, Monica, 2018. "Deaths without denominators: using a matched dataset to study mortality patterns in the United States," SocArXiv q79ye, Center for Open Science.
    19. Zhu, Ziming, 2022. "Like father like son? Intergenerational immobility in England, 1851-1911," Economic History Working Papers 117588, London School of Economics and Political Science, Department of Economic History.

    More about this item

    JEL classification:

    • C10 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - General
    • J01 - Labor and Demographic Economics - - General - - - Labor Economics: General
    • J10 - Labor and Demographic Economics - - Demographic Economics - - - General
    • N00 - Economic History - - General - - - General

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:vhimxx:v:53:y:2020:i:2:p:94-111. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/vhim20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.