An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms-classification and regression trees, bagging, random forests, and support vector machines-are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.
If you experience problems downloading a file, check if you have the proper application to view it first. In case of further problems read the IDEAS help page. Note that these files are not on the IDEAS site. Please be patient as the files may be large.
As the access to this document is restricted, you may want to look for a different version under "Related research" (further below) or search for a different version of it.
References listed on IDEAS
Please report citation or reference errors to , or , if you are the registered author of the cited work, log in to your RePEc Author Service profile, click on "citations" and make appropriate adjustments.:
- Iacus, Stefano M. & Porro, Giuseppe, 2007. "Missing data imputation, matching and other applications of random recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(2), pages 773-789, October.
- Skinner, Chris & Shlomo, Natalie, 2008. "Assessing Identification Risk in Survey Microdata Using Log-Linear Models," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 989-1001.
- Drechsler, JÃ¶rg & Reiter, Jerome P., 2010. "Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata," Journal of the American Statistical Association, American Statistical Association, vol. 105(492), pages 1347-1357.
- Jörg Drechsler, 2012. "New data dissemination approaches in old Europe -- synthetic datasets for a German establishment survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 39(2), pages 243-265, April.
- Simon D. Woodcock & Gary Benedetto, 2006.
"Distribution Preserving Statistical Disclosure Limitation,"
Longitudinal Employer-Household Dynamics Technical Papers
2006-04, Center for Economic Studies, U.S. Census Bureau.
- Woodcock, Simon D. & Benedetto, Gary, 2009. "Distribution-preserving statistical disclosure limitation," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4228-4242, October.
- Simon D. Woodcock & Gary Benedetto, 2007. "Distribution-Preserving Statistical Disclosure Limitation," Discussion Papers dp07-15, Department of Economics, Simon Fraser University.
- Woodcock, Simon & Benedetto, Gary, 2006. "Distribution-Preserving Statistical Disclosure Limitation," MPRA Paper 155, University Library of Munich, Germany.
- Choi, Hosik & Yeo, Donghwa & Kwon, Sunghoon & Kim, Yongdai, 2011. "Gene selection and prediction for cancer classification using support vector machines with a reject option," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1897-1908, May.
- Reiter, Jerome P. & Raghunathan, Trivellore E., 2007. "The Multiple Adaptations of Multiple Imputation," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1462-1471, December.
- John M. Abowd & Simon D. Woodcock, 2004. "Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data," Longitudinal Employer-Household Dynamics Technical Papers 2004-04, Center for Economic Studies, U.S. Census Bureau.
- Drechsler, Jörg & Dundler, Agnes & Bender, Stefan & Rässler, Susanne & Zwick, Thomas, 2007. "A new approach for disclosure control in the IAB Establishment Panel : multiple imputation for a better data access," IAB Discussion Paper 200711, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
- Shim, Jooyong & Sohn, Insuk & Kim, Sujong & Lee, Jae Won & Green, Paul E. & Hwang, Changha, 2009. "Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1736-1742, March.
- Jerome P. Reiter, 2005. "Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 168(1), pages 185-205.
When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:55:y:2011:i:12:p:3232-3243. See general information about how to correct material in RePEc.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Zhang, Lei)
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
If references are entirely missing, you can add them using this form.
If the full references list an item that is present in RePEc, but the system did not link to it, you can help with this form.
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your profile, as there may be some citations waiting for confirmation.
Please note that corrections may take a couple of weeks to filter through the various RePEc services.