IDEAS home Printed from https://ideas.repec.org/a/taf/amstat/v76y2022i4p384-393.html
   My bibliography  Save this article

A Practical Approach to Proper Inference with Linked Data

Author

Listed:
  • Andee Kaplan
  • Brenda Betancourt
  • Rebecca C. Steorts

Abstract

Entity resolution (ER), comprising record linkage and deduplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the downstream task. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated on three simulated datasets and one application – determining the relationship between demographic information and party affiliation in voter registration data from the North Carolina State Board of Elections. We first perform Bayesian ER and evaluate our proposed methods for canonicalization before considering the downstream tasks of linear and logistic regression. Bayesian canonicalization methods are empirically shown to improve downstream inference in both settings through prediction and coverage.

Suggested Citation

  • Andee Kaplan & Brenda Betancourt & Rebecca C. Steorts, 2022. "A Practical Approach to Proper Inference with Linked Data," The American Statistician, Taylor & Francis Journals, vol. 76(4), pages 384-393, October.
  • Handle: RePEc:taf:amstat:v:76:y:2022:i:4:p:384-393
    DOI: 10.1080/00031305.2022.2041482
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/00031305.2022.2041482
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/00031305.2022.2041482?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:amstat:v:76:y:2022:i:4:p:384-393. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/UTAS20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.