IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v7y2022i12p178-d1000188.html
   My bibliography  Save this article

Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances

Author

Listed:
  • Gunjan Chandra

    (Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland)

  • Pekka Siirtola

    (Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland)

  • Satu Tamminen

    (Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland)

  • Mikael J. Knip

    (Pediatric Research Center, Children’s Hospital, University of Helsinki and Helsinki University Hospital, Yliopistonkatu 4, 00100 Helsinki, Finland
    Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, University of Helsinki, Yliopistonkatu 3, 00014 Helsinki, Finland)

  • Riitta Veijola

    (Department of Paediatrics, University of Oulu, Oulu University Hospital, Kajaanintie 50, 90220 Oulu, Finland)

  • Juha Röning

    (Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland)

Abstract

Clinical data analysis could lead to breakthroughs. However, clinical data contain sensitive information about participants that could be utilized for unethical activities, such as blackmailing, identity theft, mass surveillance, or social engineering. Data anonymization is a standard step during data collection, before sharing, to overcome the risk of disclosure. However, conventional data anonymization techniques are not foolproof and also hinder the opportunity for personalized evaluations. Much research has been done for synthetic data generation using generative adversarial networks and many other machine learning methods; however, these methods are either not free to use or are limited in capacity. This study evaluates the performance of an emerging tool named synthpop, an R package producing synthetic data as an alternative approach for data anonymization. This paper establishes data standards derived from the original data set based on the utilities and quality of information and measures variations in the synthetic data set to evaluate the performance of the data synthesis process. The methods to assess the utility of the synthetic data set can be broadly divided into two approaches: general utility and specific utility. General utility assesses whether synthetic data have overall similarities in the statistical properties and multivariate relationships with the original data set. Simultaneously, the specific utility assesses the similarity of a fitted model’s performance on the synthetic data to its performance on the original data. The quality of information is assessed by comparing variations in entropy bits and mutual information to response variables within the original and synthetic data sets. The study reveals that synthetic data succeeded at all utility tests with a statistically non-significant difference and not only preserved the utilities but also preserved the complexity of the original data set according to the data standard established in this study. Therefore, synthpop fulfills all the necessities and unfolds a wide range of opportunities for the research community, including easy data sharing and information protection.

Suggested Citation

  • Gunjan Chandra & Pekka Siirtola & Satu Tamminen & Mikael J. Knip & Riitta Veijola & Juha Röning, 2022. "Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances," Data, MDPI, vol. 7(12), pages 1-26, December.
  • Handle: RePEc:gam:jdataj:v:7:y:2022:i:12:p:178-:d:1000188
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/7/12/178/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/7/12/178/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Thijs Devriendt & Pascal Borry & Mahsa Shabani, 2021. "Factors that influence data sharing through data sharing platforms: A qualitative study on the views and experiences of cohort holders and platform developers," PLOS ONE, Public Library of Science, vol. 16(7), pages 1-14, July.
    2. Nowok, Beata & Raab, Gillian M. & Dibben, Chris, 2016. "synthpop: Bespoke Creation of Synthetic Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i11).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dominik Bietsch & Robert Stahlbock & Stefan Voß, 2023. "Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction," Sustainability, MDPI, vol. 15(18), pages 1-30, September.
    2. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    3. Daiho Uhm & Sunghae Jun, 2022. "Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples," Future Internet, MDPI, vol. 14(7), pages 1-11, July.
    4. Felix Ritchie & Jim Smith, 2019. "Confidentiality and linked data," Papers 1907.06465, arXiv.org.
    5. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    6. Wesley J. Marrero & Mariel S. Lavieri & Jeremy B. Sussman, 2021. "Optimal cholesterol treatment plans and genetic testing strategies for cardiovascular diseases," Health Care Management Science, Springer, vol. 24(1), pages 1-25, March.
    7. Jahangir Alam M. & Dostie Benoit & Drechsler Jörg & Vilhuber Lars, 2020. "Applying data synthesis for longitudinal business data across three countries," Statistics in Transition New Series, Polish Statistical Association, vol. 21(4), pages 212-236, August.
    8. Erik D. Mueller & J. S. Onésimo Sandoval & Srikanth P. Mudigonda & Michael Elliott, 2019. "Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri," Journal of Computational Social Science, Springer, vol. 2(2), pages 271-291, July.
    9. Asunur Cezar & Srinivasan Raghunathan & Sumit Sarkar, 2020. "Adversarial Classification: Impact of Agents’ Faking Cost on Firms and Agents," Production and Operations Management, Production and Operations Management Society, vol. 29(12), pages 2789-2807, December.
    10. Speidel, Matthias & Drechsler, Jörg & Jolani, Shahab, 2018. "R package hmi: a convenient tool for hierarchical multiple imputation and beyond," IAB-Discussion Paper 201816, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
    11. Lau Lilleholt & Ingo Zettler & Cornelia Betsch & Robert Böhm, 2023. "Development and validation of the pandemic fatigue scale," Nature Communications, Nature, vol. 14(1), pages 1-19, December.
    12. Stefan Wimmer & Robert Finger, 2023. "A note on synthetic data for replication purposes in agricultural economics," Journal of Agricultural Economics, Wiley Blackwell, vol. 74(1), pages 316-323, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:7:y:2022:i:12:p:178-:d:1000188. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.