IDEAS home Printed from https://ideas.repec.org/a/vrs/offsta/v36y2020i1p89-115n5.html
   My bibliography  Save this article

A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Data Sets

Author

Listed:
  • Goldstein Harvey

    (Graduate School of Education, University of Bristol, Bristol, BS8 1JA, UK.)

  • Shlomo Natalie

    (University of Manchester, Social Statistics, Humanities Bridgeford Street Manchester, M13 9PL, UK.)

Abstract

The requirement to anonymise data sets that are to be released for secondary analysis should be balanced by the need to allow their analysis to provide efficient and consistent parameter estimates. The proposal in this article is to integrate the process of anonymisation and data analysis. The first stage uses the addition of random noise with known distributional properties to some or all variables in a released (already pseudonymised) data set, in which the values of some identifying and sensitive variables for data subjects of interest are also available to an external ‘attacker’ who wishes to identify those data subjects in order to interrogate their records in the data set. The second stage of the analysis consists of specifying the model of interest so that parameter estimation accounts for the added noise. Where the characteristics of the noise are made available to the analyst by the data provider, we propose a new method that allows a valid analysis. This is formally a measurement error model and we describe a Bayesian MCMC algorithm that recovers consistent estimates of the true model parameters. A new method for handling categorical data is presented. The article shows how an appropriate noise distribution can be determined.

Suggested Citation

  • Goldstein Harvey & Shlomo Natalie, 2020. "A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Data Sets," Journal of Official Statistics, Sciendo, vol. 36(1), pages 89-115, March.
  • Handle: RePEc:vrs:offsta:v:36:y:2020:i:1:p:89-115:n:5
    DOI: 10.2478/jos-2020-0005
    as

    Download full text from publisher

    File URL: https://doi.org/10.2478/jos-2020-0005
    Download Restriction: no

    File URL: https://libkey.io/10.2478/jos-2020-0005?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Shlomo, Natalie & Skinner, Chris J., 2010. "Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata," LSE Research Online Documents on Economics 39119, London School of Economics and Political Science, LSE Library.
    2. Lawrence H. Cox & Alan F. Karr & Satkartar K. Kinney, 2011. "Risk‐Utility Paradigms for Statistical Disclosure Limitation: How to Think, But Not How to Act," International Statistical Review, International Statistical Institute, vol. 79(2), pages 160-183, August.
    3. Harvey Goldstein & James Carpenter & Michael G. Kenward, 2018. "Bayesian models for weighted data with missing values: a bootstrap approach," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 67(4), pages 1071-1081, August.
    4. Delaigle, Aurore & Hall, Peter, 2008. "Using SIMEX for Smoothing-Parameter Choice in Errors-in-Variables Problems," Journal of the American Statistical Association, American Statistical Association, vol. 103, pages 280-287, March.
    5. Jerome P. Reiter, 2005. "Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 168(1), pages 185-205, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tapan K. Nayak & Samson A. Adeshiyan, 2016. "On Invariant Post-randomization for Statistical Disclosure Control," International Statistical Review, International Statistical Institute, vol. 84(1), pages 26-42, April.
    2. Hang J. Kim & Jerome P. Reiter & Alan F. Karr, 2018. "Simultaneous edit-imputation and disclosure limitation for business establishment data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 63-82, January.
    3. Drechsler, Jörg & Reiter, Jerome P., 2011. "An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3232-3243, December.
    4. Delaigle, Aurore & Fan, Jianqing & Carroll, Raymond J., 2009. "A Design-Adaptive Local Polynomial Estimator for the Errors-in-Variables Problem," Journal of the American Statistical Association, American Statistical Association, vol. 104(485), pages 348-359.
    5. Klein Martin & Sinha Bimal, 2013. "Statistical Analysis of Noise-Multiplied Data Using Multiple Imputation," Journal of Official Statistics, Sciendo, vol. 29(3), pages 425-465, June.
    6. Huixia Judy Wang & Leonard A. Stefanski & Zhongyi Zhu, 2012. "Corrected-loss estimation for quantile regression with covariate measurement errors," Biometrika, Biometrika Trust, vol. 99(2), pages 405-421.
    7. Natalie Shlomo & Chris Skinner, 2022. "Measuring risk of re‐identification in microdata: State‐of‐the art and new directions," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1644-1662, October.
    8. Woodcock, Simon D. & Benedetto, Gary, 2009. "Distribution-preserving statistical disclosure limitation," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4228-4242, October.
    9. Aurore Delaigle & Peter Hall, 2016. "Methodology for non-parametric deconvolution when the error distribution is unknown," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(1), pages 231-252, January.
    10. Julie McIntyre & Leonard Stefanski, 2011. "Density Estimation with Replicate Heteroscedastic Measurements," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 63(1), pages 81-99, February.
    11. Karun Adusumilli & Taisuke Otsu, 2015. "Nonparametric instrumental regression with errors in variables," STICERD - Econometrics Paper Series /2015/585, Suntory and Toyota International Centres for Economics and Related Disciplines, LSE.
    12. Bernard Baffour & James Raymer, 2019. "Estimating multiregional survivorship probabilities for sparse data: An application to immigrant populations in Australia, 1981–2011," Demographic Research, Max Planck Institute for Demographic Research, Rostock, Germany, vol. 40(18), pages 463-502.
    13. Hao Dong & Taisuke Otsu & Luke Taylor, 2022. "Nonparametric estimation of additive models with errors-in-variables," Econometric Reviews, Taylor & Francis Journals, vol. 41(10), pages 1164-1204, November.
    14. Dong Hua & Meeden Glen, 2016. "Constructing Synthetic Samples," Journal of Official Statistics, Sciendo, vol. 32(1), pages 113-127, March.
    15. Martin Klein & Ricardo Moura & Bimal Sinha, 2021. "Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 83(1), pages 273-287, May.
    16. Joseph W. Sakshaug & Trivellore E. Raghunathan, 2014. "Generating synthetic microdata to estimate small area statistics in the American Community Survey," Statistics in Transition new series, Główny Urząd Statystyczny (Polska), vol. 15(3), pages 341-368, June.
    17. Hao Dong & Taisuke Otsu & Luke Taylor, 2023. "Bandwidth selection for nonparametric regression with errors-in-variables," Econometric Reviews, Taylor & Francis Journals, vol. 42(4), pages 393-419, April.
    18. Hao Dong & Taisuke Otsu, 2018. "Nonparametric Estimation of Additive Model With Errors-in-Variables," Departmental Working Papers 1812, Southern Methodist University, Department of Economics.
    19. Yiping Yang & Tiejun Tong & Gaorong Li, 2019. "SIMEX estimation for single-index model with covariate measurement error," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 103(1), pages 137-161, March.
    20. Reiter, Jerome P., 2008. "Selecting the number of imputed datasets when using multiple imputation for missing data and disclosure limitation," Statistics & Probability Letters, Elsevier, vol. 78(1), pages 15-20, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:vrs:offsta:v:36:y:2020:i:1:p:89-115:n:5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.sciendo.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.