IDEAS home Printed from https://ideas.repec.org/a/gam/jsusta/v13y2021i11p6318-d567774.html
   My bibliography  Save this article

Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach

Author

Listed:
  • Rafael Rodríguez

    (Instituto de Mecánica de los Fluidos e Ingeniería Ambiental (IMFIA), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

  • Marcos Pastorini

    (Instituto de Computación (InCo), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

  • Lorena Etcheverry

    (Instituto de Computación (InCo), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

  • Christian Chreties

    (Instituto de Mecánica de los Fluidos e Ingeniería Ambiental (IMFIA), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

  • Mónica Fossati

    (Instituto de Mecánica de los Fluidos e Ingeniería Ambiental (IMFIA), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

  • Alberto Castro

    (Instituto de Computación (InCo), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

  • Angela Gorgoglione

    (Instituto de Mecánica de los Fluidos e Ingeniería Ambiental (IMFIA), Facultad de Ingeniería, Universidad de la República, Montevideo 11300, Uruguay)

Abstract

The monitoring of surface-water quality followed by water-quality modeling and analysis are essential for generating effective strategies in surface-water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implement univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR) and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered “satisfactory” (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than those positioned along the mainstream. IDW was the model with the best imputation results, followed by RFR, HR and SVR. The approach proposed in this study is expected to aid water-resource researchers and managers in augmenting water-quality datasets and overcoming the missing data issue to increase the number of future studies related to the water-quality matter.

Suggested Citation

  • Rafael Rodríguez & Marcos Pastorini & Lorena Etcheverry & Christian Chreties & Mónica Fossati & Alberto Castro & Angela Gorgoglione, 2021. "Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach," Sustainability, MDPI, vol. 13(11), pages 1-17, June.
  • Handle: RePEc:gam:jsusta:v:13:y:2021:i:11:p:6318-:d:567774
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2071-1050/13/11/6318/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2071-1050/13/11/6318/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Antonio Mucherino & Petraq J. Papajorgji & Panos M. Pardalos, 2009. "Data Mining in Agriculture," Springer Optimization and Its Applications, Springer, number 978-0-387-88615-2, September.
    2. Antonio Mucherino & Petraq J. Papajorgji & Panos M. Pardalos, 2009. "k-Nearest Neighbor Classification," Springer Optimization and Its Applications, in: Data Mining in Agriculture, chapter 0, pages 83-106, Springer.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Daeryong Park & Myoung-Jin Um & Momcilo Markus & Kichul Jung & Laura Keefer & Siddhartha Verma, 2021. "Insights from an Evaluation of Nitrate Load Estimation Methods in the Midwestern United States," Sustainability, MDPI, vol. 13(13), pages 1-23, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Johannes Berens & Kerstin Schneider & Simon Görtz & Simon Oster & Julian Burghoff, 2018. "Early Detection of Students at Risk – Predicting Student Dropouts Using Administrative Student Data and Machine Learning Methods," CESifo Working Paper Series 7259, CESifo.
    2. Arif Jamal Siddiqui & Sadaf Jahan & Maqsood Ahmed Siddiqui & Andleeb Khan & Mohammed Merae Alshahrani & Riadh Badraoui & Mohd Adnan, 2023. "Targeting Monoamine Oxidase B for the Treatment of Alzheimer’s and Parkinson’s Diseases Using Novel Inhibitors Identified Using an Integrated Approach of Machine Learning and Computer-Aided Drug Desig," Mathematics, MDPI, vol. 11(6), pages 1-17, March.
    3. Chetan Badgujar & Sanjoy Das & Dania Martinez Figueroa & Daniel Flippo, 2023. "Application of Computational Intelligence Methods in Agricultural Soil–Machine Interaction: A Review," Agriculture, MDPI, vol. 13(2), pages 1-39, January.
    4. Hui Zou & Zhihong Zou & Xiaojing Wang, 2015. "An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China," IJERPH, MDPI, vol. 12(11), pages 1-14, November.
    5. Odile Carisse & Mamadou Lamine Fall, 2021. "Decision Trees to Forecast Risks of Strawberry Powdery Mildew Caused by Podosphaera aphanis," Agriculture, MDPI, vol. 11(1), pages 1-16, January.
    6. Orkida Ilollari & Petraq Papajorgji & Adrian Civici & Howard Moskowitz, 2022. "Measuring Client’s Feelings on Mobile Banking," Review of Applied Socio-Economic Research, Pro Global Science Association, vol. 23(1), pages 28-39, June.
    7. Junlong Zhang & Youbin He & Yuan Zhang & Weifeng Li & Junjie Zhang, 2022. "Well-Logging-Based Lithology Classification Using Machine Learning Methods for High-Quality Reservoir Identification: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China," Energies, MDPI, vol. 15(10), pages 1-15, May.
    8. Muhammad Islam & Muhammad Usman & Azhar Mahmood & Aaqif Afzaal Abbasi & Oh-Young Song, 2020. "Predictive analytics framework for accurate estimation of child mortality rates for Internet of Things enabled smart healthcare systems," International Journal of Distributed Sensor Networks, , vol. 16(5), pages 15501477209, May.
    9. Danijel Jevtic & Romain Deleze & Joerg Osterrieder, 2022. "AI for trading strategies," Papers 2208.07168, arXiv.org.
    10. Bohumil Kába, 2011. "Exploratory analysis of selected indicators of the Czech Republic regional labour markets," Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis, Mendel University Press, vol. 59(4), pages 123-128.
    11. Yotsaphat Kittichotsatsawat & Varattaya Jangkrajarng & Korrakot Yaibuathet Tippayawong, 2021. "Enhancing Coffee Supply Chain towards Sustainable Growth with Big Data and Modern Agricultural Technologies," Sustainability, MDPI, vol. 13(8), pages 1-20, April.
    12. Peláez-Rodríguez, C. & Pérez-Aracil, J. & Fister, D. & Prieto-Godino, L. & Deo, R.C. & Salcedo-Sanz, S., 2022. "A hierarchical classification/regression algorithm for improving extreme wind speed events prediction," Renewable Energy, Elsevier, vol. 201(P2), pages 157-178.
    13. Zonlehoua Coulibali & Athyna Nancy Cambouris & Serge-Étienne Parent, 2020. "Site-specific machine learning predictive fertilization models for potato crops in Eastern Canada," PLOS ONE, Public Library of Science, vol. 15(8), pages 1-32, August.
    14. Antiopi Panteli & Basilis Boutsinas & Ioannis Giannikos, 2021. "On solving the multiple p-median problem based on biclustering," Operational Research, Springer, vol. 21(1), pages 775-799, March.
    15. Lynn Wu & Lorin Hitt & Bowen Lou, 2020. "Data Analytics, Innovation, and Firm Productivity," Management Science, INFORMS, vol. 66(5), pages 2017-2039, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jsusta:v:13:y:2021:i:11:p:6318-:d:567774. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.