IDEAS home Printed from https://ideas.repec.org/a/isa/journl/v12y2010i2-3p31-58.html
   My bibliography  Save this article

A Novel Suite of Methods for Mixture Based Record Linkage

Author

Listed:
  • Diego Zardetto
  • Monica Scannapieco

    (Italian National Institute of Statistics)

Abstract

Record Linkage (RL) aims at identifying pairs of records coming from different sources and representing the same real world object. Despite several methods have been proposed to face RL problems, none of them seems to be at the same time fully automated and very effective. In this paper we present a novel suite of methods that instead possesses both these abilities. We adopt a mixt pure-model based approach, which structures a RL process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. We present several experiments on real data that validate our methods and show their excellent effectiveness

Suggested Citation

  • Diego Zardetto & Monica Scannapieco, 2010. "A Novel Suite of Methods for Mixture Based Record Linkage," Rivista di statistica ufficiale, ISTAT - Italian National Institute of Statistics - (Rome, ITALY), vol. 12(2-3), pages 31-58, October.
  • Handle: RePEc:isa:journl:v:12:y:2010:i:2-3:p:31-58
    as

    Download full text from publisher

    File URL: http://www.istat.it/it/files/2011/09/2-3_2010_2.pdf
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    Record linkage; Mixture parameters;

    JEL classification:

    • C81 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Methodology for Collecting, Estimating, and Organizing Microeconomic Data; Data Access
    • C89 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:isa:journl:v:12:y:2010:i:2-3:p:31-58. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Stefania Rossetti (email available below). General contact details of provider: https://edirc.repec.org/data/istgvit.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.