IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i15p3278-d1202744.html
   My bibliography  Save this article

On the Quality of Synthetic Generated Tabular Data

Author

Listed:
  • Erica Espinosa

    (Department of Mathematics Engineering, Politecnico di Milano, 20133 Milan, Italy)

  • Alvaro Figueira

    (Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
    INESCTEC, 4200-465 Porto, Portugal)

Abstract

Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the suitability of such generated data is crucial to ensure their alignment with the original data distribution. Utility measures come into play here to quantify how similar the distribution of the generated data is to the original one. For tabular data, there are various evaluation methods that assess different characteristics of the generated data. In this study, we collected utility measures and categorized them based on the type of analysis they performed. We then applied these measures to synthetic data generated from two well-known datasets, Adults Income, and Liar+. We also used five well-known generative models, Borderline SMOTE, DataSynthesizer, CTGAN, CopulaGAN, and REaLTabFormer, to generate the synthetic data and evaluated its quality using the utility measures. The measurements have proven to be informative, indicating that if one synthetic dataset is superior to another in terms of utility measures, it will be more effective as an augmentation for the minority class when performing classification tasks.

Suggested Citation

  • Erica Espinosa & Alvaro Figueira, 2023. "On the Quality of Synthetic Generated Tabular Data," Mathematics, MDPI, vol. 11(15), pages 1-18, July.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:15:p:3278-:d:1202744
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/15/3278/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/15/3278/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Saksham Jain & Gautam Seth & Arpit Paruthi & Umang Soni & Girish Kumar, 2022. "Synthetic data augmentation for surface defect detection and classification using deep learning," Journal of Intelligent Manufacturing, Springer, vol. 33(4), pages 1007-1020, April.
    2. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yue Li & Qingyu Hu & Guilan Xie & Gong Chen, 2023. "Prediction of the Health Status of Older Adults Using Oversampling and Neural Network," Mathematics, MDPI, vol. 11(24), pages 1-33, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Claire McKay Bowen & Fang Liu & Bingyue Su, 2021. "Differentially private data release via statistical election to partition sequentially," METRON, Springer;Sapienza Università di Roma, vol. 79(1), pages 1-31, April.
    2. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    3. Daiho Uhm & Sunghae Jun, 2022. "Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples," Future Internet, MDPI, vol. 14(7), pages 1-11, July.
    4. Hang J. Kim & Jörg Drechsler & Katherine J. Thompson, 2021. "Synthetic microdata for establishment surveys under informative sampling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(1), pages 255-281, January.
    5. Songling Huang & Lisha Peng & Hongyu Sun & Shisong Li, 2023. "Deep Learning for Magnetic Flux Leakage Detection and Evaluation of Oil & Gas Pipelines: A Review," Energies, MDPI, vol. 16(3), pages 1-27, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:15:p:3278-:d:1202744. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.