An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms-classification and regression trees, bagging, random forests, and support vector machines-are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.
References listed on IDEAS
Please report citation or reference errors to , or , if you are the registered author of the cited work, log in to your RePEc Author Service profile, click on "citations" and make appropriate adjustments.:
- Woodcock, Simon D. & Benedetto, Gary, 2009.
"Distribution-preserving statistical disclosure limitation,"
Computational Statistics & Data Analysis,
Elsevier, vol. 53(12), pages 4228-4242, October.
- Woodcock, Simon & Benedetto, Gary, 2006. "Distribution-Preserving Statistical Disclosure Limitation," MPRA Paper 155, University Library of Munich, Germany.
- Simon D. Woodcock & Gary Benedetto, 2006. "Distribution Preserving Statistical Disclosure Limitation," Longitudinal Employer-Household Dynamics Technical Papers 2006-04, Center for Economic Studies, U.S. Census Bureau.
- Simon D. Woodcock & Gary Benedetto, 2007. "Distribution-Preserving Statistical Disclosure Limitation," Discussion Papers dp07-15, Department of Economics, Simon Fraser University.
- Iacus, Stefano M. & Porro, Giuseppe, 2007. "Missing data imputation, matching and other applications of random recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(2), pages 773-789, October.
- Reiter, Jerome P. & Raghunathan, Trivellore E., 2007. "The Multiple Adaptations of Multiple Imputation," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1462-1471, December.
- Jerome P. Reiter, 2005. "Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 168(1), pages 185-205.
- Drechsler, JÃ¶rg & Reiter, Jerome P., 2010. "Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata," Journal of the American Statistical Association, American Statistical Association, vol. 105(492), pages 1347-1357.
- Drechsler, Jörg & Dundler, Agnes & Bender, Stefan & Rässler, Susanne & Zwick, Thomas, 2007. "A new approach for disclosure control in the IAB Establishment Panel : multiple imputation for a better data access," IAB Discussion Paper 200711, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
- Shim, Jooyong & Sohn, Insuk & Kim, Sujong & Lee, Jae Won & Green, Paul E. & Hwang, Changha, 2009. "Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1736-1742, March.
- Choi, Hosik & Yeo, Donghwa & Kwon, Sunghoon & Kim, Yongdai, 2011. "Gene selection and prediction for cancer classification using support vector machines with a reject option," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1897-1908, May.
- Skinner, Chris & Shlomo, Natalie, 2008. "Assessing Identification Risk in Survey Microdata Using Log-Linear Models," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 989-1001.
- Jörg Drechsler, 2012. "New data dissemination approaches in old Europe -- synthetic datasets for a German establishment survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 39(2), pages 243-265, April.
- John M. Abowd & Simon D. Woodcock, 2004. "Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data," Longitudinal Employer-Household Dynamics Technical Papers 2004-04, Center for Economic Studies, U.S. Census Bureau. Full references (including those not matched with items on IDEAS)
When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243. See general information about how to correct material in RePEc.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Dana Niculescu)
If references are entirely missing, you can add them using this form.