IDEAS home Printed from https://ideas.repec.org/a/spr/compst/v40y2025i7d10.1007_s00180-025-01643-0.html
   My bibliography  Save this article

Synthetic data generation method providing enhanced covariance matrix estimation

Author

Listed:
  • Seungkyu Kim

    (Seoul National University)

  • Johan Lim

    (Seoul National University)

  • Donghyeon Yu

    (Inha University)

Abstract

Synthetic data generation is an important tool to ensure data confidentiality. Various synthetic data generators have been developed in the literature. The methods in the literature are mostly for general purposes. They aim to generate data whose distributions are the same as the original data set, and the synthesized data are used for every purpose depending on who uses them. However, it could not be good for all purposes. In this paper, we study the synthetic data generation tailored for a specific purpose. We are particularly interested incovariance matrix estimation, which is a key part of many multivariate statistical analyses. To do it, we first see the connection between the sequential regression model and the modified Cholesky decomposition. We then devise a new synthetic data generator, named SynCov, that controls the error variances of the sequential regression model. We show that the sample covariance matrix of the synthetic data generated by SynCov is equivalent to a shrinkage covariance matrix estimator, which reduces estimation error in Frobenius norm. Our comprehensive numerical study shows that SynCov performs better than other synthetic data generation methods in covariance matrix estimation. Finally, we apply our SynCov to two real data examples, (i) the estimation of the covariance matrix of the (selected) variables of the Los Angeles City Employee Payroll data and (ii) the classification of the Taiwanese Bankruptcy Data.

Suggested Citation

  • Seungkyu Kim & Johan Lim & Donghyeon Yu, 2025. "Synthetic data generation method providing enhanced covariance matrix estimation," Computational Statistics, Springer, vol. 40(7), pages 4007-4035, September.
  • Handle: RePEc:spr:compst:v:40:y:2025:i:7:d:10.1007_s00180-025-01643-0
    DOI: 10.1007/s00180-025-01643-0
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00180-025-01643-0
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00180-025-01643-0?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Xiaoning Kang & Xinwei Deng & Kam‐Wah Tsui & Mohsen Pourahmadi, 2020. "On variable ordination of modified Cholesky decomposition for estimating time‐varying covariance matrices," International Statistical Review, International Statistical Institute, vol. 88(3), pages 616-641, December.
    2. Drechsler, Jörg & Reiter, Jerome P., 2011. "An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3232-3243, December.
    3. Hang J. Kim & Jörg Drechsler & Katherine J. Thompson, 2021. "Synthetic microdata for establishment surveys under informative sampling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(1), pages 255-281, January.
    4. Jared S. Murray & Jerome P. Reiter, 2016. "Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1466-1479, October.
    5. Jianhua Z. Huang & Naiping Liu & Mohsen Pourahmadi & Linxu Liu, 2006. "Covariance matrix selection and estimation via penalised normal likelihood," Biometrika, Biometrika Trust, vol. 93(1), pages 85-98, March.
    6. Xiaoning Kang & Chaoping Xie & Mingqiu Wang, 2020. "A Cholesky-based estimation for large-dimensional covariance matrices," Journal of Applied Statistics, Taylor & Francis Journals, vol. 47(6), pages 1017-1030, April.
    7. Rajaratnam, Bala & Salzman, Julia, 2013. "Best permutation analysis," Journal of Multivariate Analysis, Elsevier, vol. 121(C), pages 193-223.
    8. Reiter, Jerome P. & Raghunathan, Trivellore E., 2007. "The Multiple Adaptations of Multiple Imputation," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1462-1471, December.
    9. Jerome P. Reiter, 2005. "Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 168(1), pages 185-205, January.
    10. Adam J. Rothman & Elizaveta Levina & Ji Zhu, 2010. "A new approach to Cholesky-based covariance regularization in high dimensions," Biometrika, Biometrika Trust, vol. 97(3), pages 539-550.
    11. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).
    12. Gérard Letac & Hélène Massam, 2004. "All Invariant Moments of the Wishart Distribution," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 31(2), pages 295-318, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kang, Xiaoning & Wang, Mingqiu, 2021. "Ensemble sparse estimation of covariance structure for exploring genetic disease data," Computational Statistics & Data Analysis, Elsevier, vol. 159(C).
    2. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    3. Gao, Zhenguo & Wang, Xinye & Kang, Xiaoning, 2023. "Ensemble LDA via the modified Cholesky decomposition," Computational Statistics & Data Analysis, Elsevier, vol. 188(C).
    4. Stefan Wimmer & Robert Finger, 2023. "A note on synthetic data for replication purposes in agricultural economics," Journal of Agricultural Economics, Wiley Blackwell, vol. 74(1), pages 316-323, February.
    5. Hang J. Kim & Jerome P. Reiter & Alan F. Karr, 2018. "Simultaneous edit-imputation and disclosure limitation for business establishment data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 63-82, January.
    6. Klein Martin & Sinha Bimal, 2013. "Statistical Analysis of Noise-Multiplied Data Using Multiple Imputation," Journal of Official Statistics, Sciendo, vol. 29(3), pages 425-465, June.
    7. Qiu, Yumou & Chen, Songxi, 2012. "Test for Bandedness of High Dimensional Covariance Matrices with Bandwidth Estimation," MPRA Paper 46242, University Library of Munich, Germany.
    8. Lam, Clifford, 2020. "High-dimensional covariance matrix estimation," LSE Research Online Documents on Economics 101667, London School of Economics and Political Science, LSE Library.
    9. Woodcock, Simon D. & Benedetto, Gary, 2009. "Distribution-preserving statistical disclosure limitation," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4228-4242, October.
    10. Wang, Luheng & Chen, Zhao & Wang, Christina Dan & Li, Runze, 2020. "Ultrahigh dimensional precision matrix estimation via refitted cross validation," Journal of Econometrics, Elsevier, vol. 215(1), pages 118-130.
    11. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    12. Humera Razzak & Christian Heumann, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    13. Razzak Humera & Heumann Christian, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Statistics Poland, vol. 20(4), pages 33-58, December.
    14. Lopes, Hedibert F. & McCulloch, Robert E. & Tsay, Ruey S., 2022. "Parsimony inducing priors for large scale state–space models," Journal of Econometrics, Elsevier, vol. 230(1), pages 39-61.
    15. Andrés F. Barrientos & Alexander Bolton & Tom Balmat & Jerome P. Reiter & John M. de Figueiredo & Ashwin Machanavajjhala & Yan Chen & Charles Kneifel & Mark DeLong, 2017. "A Framework for Sharing Confidential Research Data, Applied to Investigating Differential Pay by Race in the U. S. Government," NBER Working Papers 23534, National Bureau of Economic Research, Inc.
    16. Yi Qian & Hui Xie, 2013. "Drive More Effective Data-Based Innovations: Enhancing the Utility of Secure Databases," NBER Working Papers 19586, National Bureau of Economic Research, Inc.
    17. Joseph W. Sakshaug & Trivellore E. Raghunathan, 2014. "Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey," Journal of Applied Statistics, Taylor & Francis Journals, vol. 41(10), pages 2103-2122, October.
    18. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).
    19. Yi, Feng & Zou, Hui, 2013. "SURE-tuned tapering estimation of large covariance matrices," Computational Statistics & Data Analysis, Elsevier, vol. 58(C), pages 339-351.
    20. Yumou Qiu & Song Xi Chen, 2015. "Bandwidth Selection for High-Dimensional Covariance Matrix Estimation," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(511), pages 1160-1174, September.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:40:y:2025:i:7:d:10.1007_s00180-025-01643-0. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.