IDEAS home Printed from https://ideas.repec.org/p/boc/scon18/31.html
   My bibliography  Save this paper

dtalink: Faster probabilistic record linking and deduplication methods in Stata for large data files

Author

Listed:
  • Keith Kranker

    (Mathematica Policy Research)

Abstract

Stata users often need to link records from two or more data files, or find duplicates within data files. Probabilistic linking methods are often used when the file(s) do not have reliable or unique identifiers, causing deterministic linking methods (such as Stata's merge or duplicates commands) to fail. For example, one might need to link files that only include inconsistently spelled names, dates of birth with typos or missing data, and addresses that change over time. Probabilistic linkage methods score each potential pair of records on the probability the two records match, so that pairs with higher overall scores indicate a better match than pairs with lower scores. Two user-written Stata commands for probabilistic linking exist (reclink and reclink2), but they do not scale efficiently. dtalink is a new program that offers streamlined probabilistic linking methods implemented in parallelized Mata code. Significant speed improvements make it practical to implement probabilistic linking methods on large, administrative data files (files with many rows or matching variables) and new features offer more flexible scoring and many-to-many matching techniques. The presentation introduces dtalink, discusses useful tips and tricks, and provides an example of linking Medicaid and birth certificates data.

Suggested Citation

  • Keith Kranker, 2018. "dtalink: Faster probabilistic record linking and deduplication methods in Stata for large data files," 2018 Stata Conference 31, Stata Users Group.
  • Handle: RePEc:boc:scon18:31
    as

    Download full text from publisher

    File URL: http://fmwww.bc.edu/repec/scon2018/columbus18_Kranker.pdf
    Download Restriction: no
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Downes, Henry & Phillips, David C. & Sullivan, James X., 2022. "The effect of emergency financial assistance on healthcare use," Journal of Public Economics, Elsevier, vol. 208(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:boc:scon18:31. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F Baum (email available below). General contact details of provider: https://edirc.repec.org/data/stataea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.