IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0283811.html
   My bibliography  Save this article

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

Author

Listed:
  • Sarah Tahamont
  • Zubin Jelveh
  • Melissa McNeill
  • Shi Yan
  • Aaron Chalfin
  • Benjamin Hansen

Abstract

While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

Suggested Citation

  • Sarah Tahamont & Zubin Jelveh & Melissa McNeill & Shi Yan & Aaron Chalfin & Benjamin Hansen, 2023. "No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile," PLOS ONE, Public Library of Science, vol. 18(4), pages 1-17, April.
  • Handle: RePEc:plo:pone00:0283811
    DOI: 10.1371/journal.pone.0283811
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0283811
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0283811&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0283811?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
    2. Alexander Gelber & Adam Isen & Judd B. Kessler, 2016. "The Effects of Youth Employment: Evidence from New York City Lotteries," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(1), pages 423-460.
    3. Asim Ijaz Khwaja & Atif Mian, 2005. "Do Lenders Favor Politically Connected Firms? Rent Provision in an Emerging Financial Market," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 120(4), pages 1371-1411.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sarah Tahamont & Zubin Jelveh & Aaron Chalfin & Shi Yan & Benjamin Hansen, 2019. "Administrative Data Linking and Statistical Power Problems in Randomized Experiments," NBER Working Papers 25657, National Bureau of Economic Research, Inc.
    2. Anna Aizer & Shari Eli & Adriana Lleras-Muney & Keyoung Lee, 2020. "Do Youth Employment Programs Work? Evidence from the New Deal," NBER Working Papers 27103, National Bureau of Economic Research, Inc.
    3. Weill, Laurent, 2011. "How corruption affects bank lending in Russia," Economic Systems, Elsevier, vol. 35(2), pages 230-243, June.
    4. Haikun Zhu, 2018. "Social Stability and Resource Allocation within Business Groups," Working Papers Series 79, Institute for New Economic Thinking.
    5. repec:spo:wpmain:info:hdl:2441/ismjpe8i38qaqpf7c0hldeicl is not listed on IDEAS
    6. Scott Gehlbach & Konstantin Sonin & Ekaterina Zhuravskaya, 2010. "Businessman Candidates," American Journal of Political Science, John Wiley & Sons, vol. 54(3), pages 718-736, July.
    7. Diegmann, André & Pohlan, Laura & Weber, Andrea, 2024. "Do Politicians Affect Firm Outcomes? Evidence from Connections to the German Federal Parliament," IZA Discussion Papers 17031, Institute of Labor Economics (IZA).
    8. Qi‐an Chen & Shuxiang Tang & Yuan Xu, 2022. "Do government subsidies and financing constraints play a dominant role in the effect of state ownership on corporate innovation? Evidence from China," Managerial and Decision Economics, John Wiley & Sons, Ltd., vol. 43(8), pages 3698-3714, December.
    9. Francis,David C. & Kubinec ,Robert, 2022. "Beyond Political Connections : A Measurement Model Approach to Estimating Firm-levelPolitical Influence in 41 Economies," Policy Research Working Paper Series 10119, The World Bank.
    10. Qin, Wei & Liang, Quanxi & Jiao, Yan & Lu, Meiting & Shan, Yaowen, 2022. "Social trust and dividend payouts: Evidence from China," Pacific-Basin Finance Journal, Elsevier, vol. 72(C).
    11. Abdul‐Rahman Khokhar & Hesam Shahriari, 2022. "Is the SEC captured? Evidence from political connectedness and SEC enforcement actions," Accounting and Finance, Accounting and Finance Association of Australia and New Zealand, vol. 62(2), pages 2725-2756, June.
    12. Marcel Fafchamps & Julien Labonne, 2017. "Do Politicians’ Relatives Get Better Jobs? Evidence from Municipal Elections," The Journal of Law, Economics, and Organization, Oxford University Press, vol. 33(2), pages 268-300.
    13. Carvalho, Augusto & Guimaraes, Bernardo, 2018. "State-controlled companies and political risk: Evidence from the 2014 Brazilian election," Journal of Public Economics, Elsevier, vol. 159(C), pages 66-78.
    14. Deniz Igan & Prachi Mishra & Thierry Tressel, 2012. "A Fistful of Dollars: Lobbying and the Financial Crisis," NBER Macroeconomics Annual, University of Chicago Press, vol. 26(1), pages 195-230.
    15. Ayadi, Rym & Arbak, Emrah & Ben-Naceur, Sami & De Groen, Willem Pieter, 2013. "Determinants of Financial Development across the Mediterranean," CEPS Papers 7770, Centre for European Policy Studies.
    16. Anne-Laure Delatte & Adrien Matray & Noémie Pinardon-Touati, 2020. "Private Credit Under Political Influence: Evidence from France," Working Papers 2020-56, Princeton University. Economics Department..
    17. Ghulam Shabbir & Mumtaz Anwar & Shahid Adil, 2016. "Corruption, Political Stability and Economic Growth," The Pakistan Development Review, Pakistan Institute of Development Economics, vol. 55(4), pages 689-702.
    18. Hjalmarsson, Randi & Machin, Stephen & Pinotti, Paolo, 2024. "Crime and the labor market," Handbook of Labor Economics,, Elsevier.
    19. Marcela Eslava & Xavier Freixas, 2021. "Public Development Banks and Credit Market Imperfections," Journal of Money, Credit and Banking, Blackwell Publishing, vol. 53(5), pages 1121-1149, August.
    20. Liu, Li & Liu, Qigui & Tian, Gary & Wang, Peipei, 2018. "Government connections and the persistence of profitability: Evidence from Chinese listed firms," Emerging Markets Review, Elsevier, vol. 36(C), pages 110-129.
    21. Yu-Hong Ai & Di-Yun Peng & Huan-Huan Xiong, 2021. "Impact of Environmental Regulation Intensity on Green Technology Innovation: From the Perspective of Political and Business Connections," Sustainability, MDPI, vol. 13(9), pages 1-23, April.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0283811. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.