IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v55y2011i12p3232-3243.html
   My bibliography  Save this article

An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

Author

Listed:
  • Drechsler, Jörg
  • Reiter, Jerome P.

Abstract

When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms-classification and regression trees, bagging, random forests, and support vector machines-are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.

Suggested Citation

  • Drechsler, Jörg & Reiter, Jerome P., 2011. "An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3232-3243, December.
  • Handle: RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947311002076
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Woodcock, Simon D. & Benedetto, Gary, 2009. "Distribution-preserving statistical disclosure limitation," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4228-4242, October.
    2. Skinner, Chris J. & Shlomo, Natalie, 2008. "Assessing identification risk in survey microdata using log-linear models," LSE Research Online Documents on Economics 39112, London School of Economics and Political Science, LSE Library.
    3. Shim, Jooyong & Sohn, Insuk & Kim, Sujong & Lee, Jae Won & Green, Paul E. & Hwang, Changha, 2009. "Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1736-1742, March.
    4. Jörg Drechsler, 2012. "New data dissemination approaches in old Europe -- synthetic datasets for a German establishment survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 39(2), pages 243-265, April.
    5. Iacus, Stefano M. & Porro, Giuseppe, 2007. "Missing data imputation, matching and other applications of random recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(2), pages 773-789, October.
    6. Reiter, Jerome P. & Raghunathan, Trivellore E., 2007. "The Multiple Adaptations of Multiple Imputation," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1462-1471, December.
    7. Jerome P. Reiter, 2005. "Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 168(1), pages 185-205, January.
    8. Drechsler, Jörg & Reiter, Jerome P., 2010. "Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata," Journal of the American Statistical Association, American Statistical Association, vol. 105(492), pages 1347-1357.
    9. Drechsler, Jörg & Dundler, Agnes & Bender, Stefan & Rässler, Susanne & Zwick, Thomas, 2007. "A new approach for disclosure control in the IAB Establishment Panel : multiple imputation for a better data access," IAB-Discussion Paper 200711, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
    10. Choi, Hosik & Yeo, Donghwa & Kwon, Sunghoon & Kim, Yongdai, 2011. "Gene selection and prediction for cancer classification using support vector machines with a reject option," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1897-1908, May.
    11. Skinner, Chris & Shlomo, Natalie, 2008. "Assessing Identification Risk in Survey Microdata Using Log-Linear Models," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 989-1001.
    12. John M. Abowd & Simon D. Woodcock, 2004. "Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data," Longitudinal Employer-Household Dynamics Technical Papers 2004-04, Center for Economic Studies, U.S. Census Bureau.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jörg Drechsler, 2015. "Multiple Imputation of Multilevel Missing Data—Rigor Versus Simplicity," Journal of Educational and Behavioral Statistics, , vol. 40(1), pages 69-95, February.
    2. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    3. Stefan Wimmer & Robert Finger, 2023. "A note on synthetic data for replication purposes in agricultural economics," Journal of Agricultural Economics, Wiley Blackwell, vol. 74(1), pages 316-323, February.
    4. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    5. Jordan C. Stanley & Evan S. Totty, 2024. "Synthetic Data and Social Science Research: Accuracy Assessments and Practical Considerations from the SIPP Synthetic Beta," NBER Chapters, in: Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and their Consequences, National Bureau of Economic Research, Inc.
    6. Hang J. Kim & Jerome P. Reiter & Alan F. Karr, 2018. "Simultaneous edit-imputation and disclosure limitation for business establishment data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 63-82, January.
    7. Daniel Manrique‐Vallier & Jingchen Hu, 2018. "Bayesian non‐parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 635-647, June.
    8. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hang J. Kim & Jerome P. Reiter & Alan F. Karr, 2018. "Simultaneous edit-imputation and disclosure limitation for business establishment data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 63-82, January.
    2. Andrés F. Barrientos & Alexander Bolton & Tom Balmat & Jerome P. Reiter & John M. de Figueiredo & Ashwin Machanavajjhala & Yan Chen & Charles Kneifel & Mark DeLong, 2017. "A Framework for Sharing Confidential Research Data, Applied to Investigating Differential Pay by Race in the U. S. Government," NBER Working Papers 23534, National Bureau of Economic Research, Inc.
    3. Loong Bronwyn & Rubin Donald B., 2017. "Multiply-Imputed Synthetic Data: Advice to the Imputer," Journal of Official Statistics, Sciendo, vol. 33(4), pages 1005-1019, December.
    4. Jörg Drechsler, 2015. "Multiple Imputation of Multilevel Missing Data—Rigor Versus Simplicity," Journal of Educational and Behavioral Statistics, , vol. 40(1), pages 69-95, February.
    5. Jerome P. Reiter, 2009. "Using Multiple Imputation to Integrate and Disseminate Confidential Microdata," International Statistical Review, International Statistical Institute, vol. 77(2), pages 179-195, August.
    6. Klein Martin & Sinha Bimal, 2013. "Statistical Analysis of Noise-Multiplied Data Using Multiple Imputation," Journal of Official Statistics, Sciendo, vol. 29(3), pages 425-465, June.
    7. Woodcock, Simon D. & Benedetto, Gary, 2009. "Distribution-preserving statistical disclosure limitation," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4228-4242, October.
    8. Gerd Ronning, 2014. "Vertraulichkeit und Verfügbarkeit von Mikrodaten," IAW Discussion Papers 101, Institut für Angewandte Wirtschaftsforschung (IAW).
    9. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    10. Humera Razzak & Christian Heumann, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    11. Razzak Humera & Heumann Christian, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Statistics Poland, vol. 20(4), pages 33-58, December.
    12. Yi Qian & Hui Xie, 2013. "Drive More Effective Data-Based Innovations: Enhancing the Utility of Secure Databases," NBER Working Papers 19586, National Bureau of Economic Research, Inc.
    13. Joseph W. Sakshaug & Trivellore E. Raghunathan, 2014. "Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 41(10), pages 2103-2122, October.
    14. Christine M. O'Keefe & James O. Chipperfield, 2013. "A Summary of Attack Methods and Confidentiality Protection Measures for Fully Automated Remote Analysis Systems," International Statistical Review, International Statistical Institute, vol. 81(3), pages 426-455, December.
    15. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).
    16. Javier Miranda & Lars Vilhuber, 2016. "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," Working Papers 16-10, Center for Economic Studies, U.S. Census Bureau.
    17. Prada, Sergio I & Gonzalez, Claudia & Borton, Joshua & Fernandes-Huessy, Johannes & Holden, Craig & Hair, Elizabeth & Mulcahy, Tim, 2011. "Avoiding disclosure of individually identifiable health information: a literature review," MPRA Paper 35463, University Library of Munich, Germany.
    18. Cinzia Carota & Maurizio Filippone & Silvia Polettini, 2022. "Assessing Bayesian Semi‐Parametric Log‐Linear Models: An Application to Disclosure Risk Estimation," International Statistical Review, International Statistical Institute, vol. 90(1), pages 165-183, April.
    19. Christine N. Kohnen & Jerome P. Reiter, 2009. "Multiple imputation for combining confidential data owned by two agencies," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 172(2), pages 511-528, April.
    20. Shlomo, Natalie & Skinner, Chris, 2022. "Measuring risk of re-identification in microdata: state-of-the art and new directions," LSE Research Online Documents on Economics 117168, London School of Economics and Political Science, LSE Library.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.