IDEAS home Printed from https://ideas.repec.org/a/spr/waterr/v37y2023i15d10.1007_s11269-023-03650-6.html
   My bibliography  Save this article

Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation

Author

Listed:
  • Min Yan Chia

    (Universiti Tunku Abdul Rahman)

  • Chai Hoon Koo

    (Universiti Tunku Abdul Rahman)

  • Yuk Feng Huang

    (Universiti Tunku Abdul Rahman)

  • Wei Chan

    (Universiti Tunku Abdul Rahman)

  • Jia Yin Pang

    (Universiti Tunku Abdul Rahman)

Abstract

Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets.

Suggested Citation

  • Min Yan Chia & Chai Hoon Koo & Yuk Feng Huang & Wei Chan & Jia Yin Pang, 2023. "Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 37(15), pages 6183-6198, December.
  • Handle: RePEc:spr:waterr:v:37:y:2023:i:15:d:10.1007_s11269-023-03650-6
    DOI: 10.1007/s11269-023-03650-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11269-023-03650-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11269-023-03650-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:waterr:v:37:y:2023:i:15:d:10.1007_s11269-023-03650-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.