IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v10y2019i1d10.1038_s41467-019-10933-3.html
   My bibliography  Save this article

Estimating the success of re-identifications in incomplete datasets using generative models

Author

Listed:
  • Luc Rocher

    (Université catholique de Louvain
    Imperial College London
    Imperial College London)

  • Julien M. Hendrickx

    (Université catholique de Louvain)

  • Yves-Alexandre de Montjoye

    (Imperial College London
    Imperial College London)

Abstract

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

Suggested Citation

  • Luc Rocher & Julien M. Hendrickx & Yves-Alexandre de Montjoye, 2019. "Estimating the success of re-identifications in incomplete datasets using generative models," Nature Communications, Nature, vol. 10(1), pages 1-9, December.
  • Handle: RePEc:nat:natcom:v:10:y:2019:i:1:d:10.1038_s41467-019-10933-3
    DOI: 10.1038/s41467-019-10933-3
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-019-10933-3
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-019-10933-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Tesary Lin & Sanjog Misra, 2022. "Frontiers: The Identity Fragmentation Bias," Marketing Science, INFORMS, vol. 41(3), pages 433-440, May.
    2. Ron S. Jarmin & John M. Abowd & Robert Ashmead & Ryan Cumings-Menon & Nathan Goldschlag & Michael B. Hawes & Sallie Ann Keller & Daniel Kifer & Philip Leclerc & Jerome P. Reiter & Rolando A. Rodrígue, 2023. "An in-depth examination of requirements for disclosure risk assessment," Proceedings of the National Academy of Sciences, Proceedings of the National Academy of Sciences, vol. 120(43), pages 2220558120-, October.
    3. Ratul Das Chaudhury & Chongwoo Choe, 2023. "Digital Privacy: GDPR and Its Lessons for Australia," Australian Economic Review, The University of Melbourne, Melbourne Institute of Applied Economic and Social Research, vol. 56(2), pages 204-220, June.
    4. John R. J. Thompson & Longlong Feng & R. Mark Reesor & Chuck Grace, 2021. "Know Your Clients’ Behaviours: A Cluster Analysis of Financial Transactions," JRFM, MDPI, vol. 14(2), pages 1-29, January.
    5. Miren Gutierrez & John Bryant, 2022. "The Fading Gloss of Data Science: Towards an Agenda that Faces the Challenges of Big Data for Development and Humanitarian Action," Development, Palgrave Macmillan;Society for International Deveopment, vol. 65(1), pages 80-93, March.
    6. Till Koebe & Alejandra Arias-Salazar & Timo Schmid, 2023. "Releasing survey microdata with exact cluster locations and additional privacy safeguards," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-13, December.
    7. James Steele & Matthew Wade & Robert J. Copeland & Stuart Stokes & Rachel Stokes & Steven Mann, 2021. "The National ReferAll Database: An Open Dataset of Exercise Referral Schemes Across the UK," IJERPH, MDPI, vol. 18(9), pages 1-17, April.
    8. Heng Xu & Nan Zhang, 2022. "Implications of Data Anonymization on the Statistical Evidence of Disparity," Management Science, INFORMS, vol. 68(4), pages 2600-2618, April.
    9. Atabey, Ayça & Pothong, Kruakae & Livingstone, Sonia, 2023. "Glossary of terms relating to children’s digital lives," LSE Research Online Documents on Economics 119728, London School of Economics and Political Science, LSE Library.
    10. Carlo Giacomo Leo & Maria Rosaria Tumolo & Saverio Sabina & Riccardo Colella & Virginia Recchia & Giuseppe Ponzini & Dimitrios Ioannis Fotiadis & Antonella Bodini & Pierpaolo Mincarone, 2022. "Health Technology Assessment for In Silico Medicine: Social, Ethical and Legal Aspects," IJERPH, MDPI, vol. 19(3), pages 1-13, January.
    11. German Data Forum RatSWD (ed.), 2020. "Data collection using new information technology," RatSWD Output Series, German Data Forum (RatSWD), volume 6, number 6-6en.
    12. Se-Ra Oh & Young-Duk Seo & Euijong Lee & Young-Gab Kim, 2021. "A Comprehensive Survey on Security and Privacy for Electronic Health Data," IJERPH, MDPI, vol. 18(18), pages 1-48, September.
    13. Jeongwook Lee & Joon Jin Song & Yongku Kim & Jung In Seo, 2020. "Estimation and Prediction of Record Values Using Pivotal Quantities and Copulas," Mathematics, MDPI, vol. 8(10), pages 1-16, October.
    14. Anastasia Roukouni & Gonçalo Homem de Almeida Correia, 2020. "Evaluation Methods for the Impacts of Shared Mobility: Classification and Critical Review," Sustainability, MDPI, vol. 12(24), pages 1-22, December.
    15. Rehse, Dominik & Tremöhlen, Felix, 2020. "Fostering participation in digital public health interventions: The case of digital contact tracing," ZEW Discussion Papers 20-076, ZEW - Leibniz Centre for European Economic Research.
    16. Sevgi Arca & Rattikorn Hewett, 2021. "Analytics on Anonymity for Privacy Retention in Smart Health Data," Future Internet, MDPI, vol. 13(11), pages 1-20, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:10:y:2019:i:1:d:10.1038_s41467-019-10933-3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.