IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v55y2011i12p3232-3243.html
   My bibliography  Save this article

An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

Author

Listed:
  • Drechsler, Jörg
  • Reiter, Jerome P.

Abstract

When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms-classification and regression trees, bagging, random forests, and support vector machines-are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.

Suggested Citation

  • Drechsler, Jörg & Reiter, Jerome P., 2011. "An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3232-3243, December.
  • Handle: RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947311002076
    Download Restriction: Full text for ScienceDirect subscribers only.

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Woodcock, Simon D. & Benedetto, Gary, 2009. "Distribution-preserving statistical disclosure limitation," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4228-4242, October.
    2. Shim, Jooyong & Sohn, Insuk & Kim, Sujong & Lee, Jae Won & Green, Paul E. & Hwang, Changha, 2009. "Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1736-1742, March.
    3. Jörg Drechsler, 2012. "New data dissemination approaches in old Europe -- synthetic datasets for a German establishment survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 39(2), pages 243-265, April.
    4. Iacus, Stefano M. & Porro, Giuseppe, 2007. "Missing data imputation, matching and other applications of random recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(2), pages 773-789, October.
    5. Reiter, Jerome P. & Raghunathan, Trivellore E., 2007. "The Multiple Adaptations of Multiple Imputation," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1462-1471, December.
    6. Jerome P. Reiter, 2005. "Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 168(1), pages 185-205, January.
    7. Drechsler, Jörg & Reiter, Jerome P., 2010. "Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata," Journal of the American Statistical Association, American Statistical Association, vol. 105(492), pages 1347-1357.
    8. Drechsler, Jörg & Dundler, Agnes & Bender, Stefan & Rässler, Susanne & Zwick, Thomas, 2007. "A new approach for disclosure control in the IAB Establishment Panel : multiple imputation for a better data access," IAB Discussion Paper 200711, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
    9. Choi, Hosik & Yeo, Donghwa & Kwon, Sunghoon & Kim, Yongdai, 2011. "Gene selection and prediction for cancer classification using support vector machines with a reject option," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1897-1908, May.
    10. Skinner, Chris & Shlomo, Natalie, 2008. "Assessing Identification Risk in Survey Microdata Using Log-Linear Models," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 989-1001.
    11. John M. Abowd & Simon D. Woodcock, 2004. "Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data," Longitudinal Employer-Household Dynamics Technical Papers 2004-04, Center for Economic Studies, U.S. Census Bureau.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    2. Hang J. Kim & Jerome P. Reiter & Alan F. Karr, 2018. "Simultaneous edit-imputation and disclosure limitation for business establishment data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 63-82, January.
    3. Daniel Manrique‐Vallier & Jingchen Hu, 2018. "Bayesian non‐parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 635-647, June.
    4. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hang J. Kim & Jerome P. Reiter & Alan F. Karr, 2018. "Simultaneous edit-imputation and disclosure limitation for business establishment data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 63-82, January.
    2. Andrés F. Barrientos & Alexander Bolton & Tom Balmat & Jerome P. Reiter & John M. de Figueiredo & Ashwin Machanavajjhala & Yan Chen & Charles Kneifel & Mark DeLong, 2017. "A Framework for Sharing Confidential Research Data, Applied to Investigating Differential Pay by Race in the U. S. Government," NBER Working Papers 23534, National Bureau of Economic Research, Inc.
    3. Jerome P. Reiter, 2009. "Using Multiple Imputation to Integrate and Disseminate Confidential Microdata," International Statistical Review, International Statistical Institute, vol. 77(2), pages 179-195, August.
    4. Klein Martin & Sinha Bimal, 2013. "Statistical Analysis of Noise-Multiplied Data Using Multiple Imputation," Journal of Official Statistics, Sciendo, vol. 29(3), pages 425-465, June.
    5. Gerd Ronning, 2014. "Vertraulichkeit und Verfügbarkeit von Mikrodaten," IAW Discussion Papers 101, Institut für Angewandte Wirtschaftsforschung (IAW).
    6. Joseph W. Sakshaug & Trivellore E. Raghunathan, 2014. "Generating synthetic microdata to estimate small area statistics in the American Community Survey," Statistics in Transition new series, Główny Urząd Statystyczny (Polska), vol. 15(3), pages 341-368, June.
    7. Humera Razzak & Christian Heumann, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    8. Reiter, Jerome P., 2008. "Selecting the number of imputed datasets when using multiple imputation for missing data and disclosure limitation," Statistics & Probability Letters, Elsevier, vol. 78(1), pages 15-20, January.
    9. Yi Qian & Hui Xie, 2013. "Drive More Effective Data-Based Innovations: Enhancing the Utility of Secure Databases," NBER Working Papers 19586, National Bureau of Economic Research, Inc.
    10. M. Jahangir Alam & Benoit Dostie & Jörg Drechsler & Lars Vilhuber, 2020. "Applying data synthesis for longitudinal business data across three countries," Statistics in Transition New Series, Polish Statistical Association, vol. 21(4), pages 212-236, August.
    11. Joseph W. Sakshaug & Trivellore E. Raghunathan, 2014. "Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 41(10), pages 2103-2122, October.
    12. Javier Miranda & Lars Vilhuber, 2016. "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," Working Papers 16-10, Center for Economic Studies, U.S. Census Bureau.
    13. Drechsler, Jörg, 2011. "Methodenreport: Synthetische Scientific-Use-Files der Welle 2007 des IAB-Betriebspanels," FDZ Methodenreport 201101_de, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
    14. Christine N. Kohnen & Jerome P. Reiter, 2009. "Multiple imputation for combining confidential data owned by two agencies," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 172(2), pages 511-528, April.
    15. Reiter, Jerome P. & Oganian, Anna & Karr, Alan F., 2009. "Verification servers: Enabling analysts to assess the quality of inferences from public use data," Computational Statistics & Data Analysis, Elsevier, vol. 53(4), pages 1475-1482, February.
    16. Yi Qian & Hui Xie, 2015. "Drive More Effective Data-Based Innovations: Enhancing the Utility of Secure Databases," Management Science, INFORMS, vol. 61(3), pages 520-541, March.
    17. Dettmann, E. & Becker, C. & Schmeißer, C., 2011. "Distance functions for matching in small samples," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1942-1960, May.
    18. Satkartar K. Kinney & Jerome P. Reiter & Javier Miranda, 2014. "Improving The Synthetic Longitudinal Business Database," Working Papers 14-12, Center for Economic Studies, U.S. Census Bureau.
    19. Eurosystem Household Finance and Consumption Network, 2013. "The Eurosystem Household Finance and Consumption Survey - Methodological report," Statistics Paper Series 1, European Central Bank.
    20. Ainara González de San Román & Yolanda F. Rebollo‐Sanz, 2018. "An Estimation Of Worker And Firm Effects With Censored Data," Bulletin of Economic Research, Wiley Blackwell, vol. 70(4), pages 459-482, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Haili He). General contact details of provider: http://www.elsevier.com/locate/csda .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.