IDEAS home Printed from https://ideas.repec.org/a/bla/istatr/v91y2023i3p368-394.html
   My bibliography  Save this article

Improving Probabilistic Record Linkage Using Statistical Prediction Models

Author

Listed:
  • Angelo Moretti
  • Natalie Shlomo

Abstract

Record linkage brings together information from records in two or more data sources that are believed to belong to the same statistical unit based on a common set of matching variables. Matching variables, however, can appear with errors and variations and the challenge is to link statistical units that are subject to error. We provide an overview of record linkage techniques and specifically investigate the classic Fellegi and Sunter probabilistic record linkage framework to assess whether the decision rule for classifying pairs into sets of matches and non‐matches can be improved by incorporating a statistical prediction model. We also study whether the enhanced linkage rule can provide better results in terms of preserving associations between variables in the linked data file that are not used in the matching procedure. A simulation study and an application based on real data are used to evaluate the methods.

Suggested Citation

  • Angelo Moretti & Natalie Shlomo, 2023. "Improving Probabilistic Record Linkage Using Statistical Prediction Models," International Statistical Review, International Statistical Institute, vol. 91(3), pages 368-394, December.
  • Handle: RePEc:bla:istatr:v:91:y:2023:i:3:p:368-394
    DOI: 10.1111/insr.12535
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/insr.12535
    Download Restriction: no

    File URL: https://libkey.io/10.1111/insr.12535?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. James Chipperfield & Noel Hansen & Peter Rossiter, 2018. "Estimating Precision and Recall for Deterministic and Probabilistic Record Linkage," International Statistical Review, International Statistical Institute, vol. 86(2), pages 219-236, August.
    2. Rubin, Donald B, 1986. "Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations," Journal of Business & Economic Statistics, American Statistical Association, vol. 4(1), pages 87-94, January.
    3. Moriarity, Chris & Scheuren, Fritz, 2003. "A Note on Rubin's Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations," Journal of Business & Economic Statistics, American Statistical Association, vol. 21(1), pages 65-73, January.
    4. Rebecca C. Steorts & Rob Hall & Stephen E. Fienberg, 2016. "A Bayesian Approach to Graphical Record Linkage and Deduplication," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1660-1672, October.
    5. Kim, Gunky & Chambers, Raymond, 2012. "Regression analysis under incomplete linkage," Computational Statistics & Data Analysis, Elsevier, vol. 56(9), pages 2756-2770.
    6. Gunky Kim & Raymond Chambers, 2012. "Regression Analysis under Probabilistic Multi‐Linkage," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(1), pages 64-79, February.
    7. Lorin M. Hitt & Frances X. Frei, 2002. "Do Better Customers Utilize Electronic Distribution Channels? The Case of PC Banking," Management Science, INFORMS, vol. 48(6), pages 732-748, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Michael S. Rendall & Bonnie Ghosh-Dastidar & Margaret M. Weden & Zafar Nazarov, 2011. "Multiple Imputation for Combined-Survey Estimation With Incomplete Regressors In One But Not Both Surveys," Working Papers WR-887-1, RAND Corporation.
    2. Okay Gunes, 2017. "Analysis of Households' Decision Using Full Demand Elasticity Estimates: an Estimation on Turkish Data," Université Paris1 Panthéon-Sorbonne (Post-Print and Working Papers) halshs-01491970, HAL.
    3. Ahfock, Daniel & Pyne, Saumyadipta & Lee, Sharon X. & McLachlan, Geoffrey J., 2016. "Partial identification in the statistical matching problem," Computational Statistics & Data Analysis, Elsevier, vol. 104(C), pages 79-90.
    4. Clinton P. McCully, 2013. "Integration of Micro and Macro Data on Consumer Income and Expenditures," BEA Working Papers 0101, Bureau of Economic Analysis.
    5. Kiesl, Hans & Rässler, Susanne, 2006. "How valid can data fusion be?," IAB-Discussion Paper 200615, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
    6. Vo, Thanh Huan & Chauvet, Guillaume & Happe, André & Oger, Emmanuel & Paquelet, Stéphane & Garès, Valérie, 2023. "Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system," Computational Statistics & Data Analysis, Elsevier, vol. 179(C).
    7. Okay Gunes, 2017. "Analysis of Households' Decision Using Full Demand Elasticity Estimates: an Estimation on Turkish Data," Post-Print halshs-01491970, HAL.
    8. Ray Chambers & Andrea Diniz da Silva, 2020. "Improved secondary analysis of linked data: a framework and an illustration," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 183(1), pages 37-59, January.
    9. Michael S. Rendall & Bonnie Ghosh-Dastidar & Margaret M. Weden & Elizabeth H. Baker & Zafar Nazarov, 2013. "Multiple Imputation for Combined-survey Estimation With Incomplete Regressors in One but Not Both Surveys," Sociological Methods & Research, , vol. 42(4), pages 483-530, November.
    10. Claramunt González, Juan & van Delden, Arnout & de Waal, Ton, 2023. "Assessment of the effect of constraints in a new multivariate mixed method for statistical matching," Computational Statistics & Data Analysis, Elsevier, vol. 177(C).
    11. Li‐Chun Zhang & Tiziana Tuoto, 2021. "Linkage‐data linear regression," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(2), pages 522-547, April.
    12. Okay Gunes, 2017. "Analysis of Households' Decision Using Full Demand Elasticity Estimates: an Estimation on Turkish Data," Documents de travail du Centre d'Economie de la Sorbonne 17017, Université Panthéon-Sorbonne (Paris 1), Centre d'Economie de la Sorbonne.
    13. François Gardes, 2021. "On the value of time and human life," Documents de travail du Centre d'Economie de la Sorbonne 21023, Université Panthéon-Sorbonne (Paris 1), Centre d'Economie de la Sorbonne.
    14. François Gardes, 2021. "A Solution to the Estimation of an Enlarged GDP Including Domestic Production: An Estimation on Micro Data," Post-Print halshs-03325362, HAL.
    15. Joost Ginkel & Pieter Kroonenberg, 2014. "Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 242-269, July.
    16. Salmani, Yasamin & Partovi, Fariborz Y., 2021. "Channel-level resource allocation decision in multichannel retailing: A U.S. multichannel company application," Journal of Retailing and Consumer Services, Elsevier, vol. 63(C).
    17. Peter ven de Ven & Anne Harrison & Barbara Fraumeni & Dennis Fixler & David Johnson & Andrew Craig & Kevin Furlong, 2017. "A Consistent Data Series to Evaluate Growth and Inequality in the National Accounts Note: The views expressed in this research, including those related to statistical, methodological, technical, or op," Review of Income and Wealth, International Association for Research in Income and Wealth, vol. 63, pages 437-459, December.
    18. Van den Poel, Dirk & Lariviere, Bart, 2004. "Customer attrition analysis for financial services using proportional hazard models," European Journal of Operational Research, Elsevier, vol. 157(1), pages 196-217, August.
    19. Norah Alyabs & Sy Han Chiou, 2022. "The Missing Indicator Approach for Accelerated Failure Time Model with Covariates Subject to Limits of Detection," Stats, MDPI, vol. 5(2), pages 1-13, May.
    20. Chenyang Gu & Roee Gutman, 2017. "Combining item response theory with multiple imputation to equate health assessment questionnaires," Biometrics, The International Biometric Society, vol. 73(3), pages 990-998, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:istatr:v:91:y:2023:i:3:p:368-394. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/isiiinl.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.