IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0301276.html
   My bibliography  Save this article

A refined approach for evaluating small datasets via binary classification using machine learning

Author

Listed:
  • Steffen Steinert
  • Verena Ruf
  • David Dzsotjan
  • Nicolas Großmann
  • Albrecht Schmidt
  • Jochen Kuhn
  • Stefan Küchemann

Abstract

Classical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally positive results. In this study, we present a refined approach for evaluating the performance of a binary classification based on machine learning for small datasets. The approach includes a non-parametric permutation test as a method to quantify the probability of the results generalising to new data. Furthermore, we found that a repeated nested cross-validation is almost free of biases and yields reliable results that are only slightly dependent on chance. Considering the advantages of several evaluation metrics, we suggest a combination of more than one metric to train and evaluate machine learning classifiers. In the specific case that both classes are equally important, the Matthews correlation coefficient exhibits the lowest bias and chance for coincidentally good results. The results indicate that it is essential to avoid several biases when analysing small datasets using machine learning.

Suggested Citation

  • Steffen Steinert & Verena Ruf & David Dzsotjan & Nicolas Großmann & Albrecht Schmidt & Jochen Kuhn & Stefan Küchemann, 2024. "A refined approach for evaluating small datasets via binary classification using machine learning," PLOS ONE, Public Library of Science, vol. 19(5), pages 1-21, May.
  • Handle: RePEc:plo:pone00:0301276
    DOI: 10.1371/journal.pone.0301276
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0301276
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0301276?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Andrius Vabalas & Emma Gowen & Ellen Poliakoff & Alexander J Casson, 2019. "Machine learning algorithm validation with a limited sample size," PLOS ONE, Public Library of Science, vol. 14(11), pages 1-20, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Li-Dunn Chen & Michael A Caprio & Devin M Chen & Andrew J Kouba & Carrie K Kouba, 2024. "Enhancing predictive performance for spectroscopic studies in wildlife science through a multi-model approach: A case study for species classification of live amphibians," PLOS Computational Biology, Public Library of Science, vol. 20(2), pages 1-24, February.
    2. Ephrem Habyarimana & Faheem S Baloch, 2021. "Machine learning models based on remote and proximal sensing as potential methods for in-season biomass yields prediction in commercial sorghum fields," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-23, March.
    3. Leandro C. Hermida & E. Michael Gertz & Eytan Ruppin, 2022. "Predicting cancer prognosis and drug response from the tumor microbiome," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    4. Jonathan C. M. Wan & Dennis Stephens & Lingqi Luo & James R. White & Caitlin M. Stewart & Benoît Rousseau & Dana W. Y. Tsui & Luis A. Diaz, 2022. "Genome-wide mutational signatures in low-coverage whole genome sequencing of cell-free DNA," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    5. Jacob Beck, 2023. "Quality aspects of annotated data," AStA Wirtschafts- und Sozialstatistisches Archiv, Springer;Deutsche Statistische Gesellschaft - German Statistical Society, vol. 17(3), pages 331-353, December.
    6. Sinha, Shruti & Sankar Rao, Chinta & Kumar, Abhishankar & Venkata Surya, Dadi & Basak, Tanmay, 2024. "Exploring and understanding the microwave-assisted pyrolysis of waste lignocellulose biomass using gradient boosting regression machine learning model," Renewable Energy, Elsevier, vol. 231(C).
    7. Ciaran Michael Kelly & Russell Lewis McLaughlin, 2024. "Comparison of machine learning methods for genomic prediction of selected Arabidopsis thaliana traits," PLOS ONE, Public Library of Science, vol. 19(8), pages 1-13, August.
    8. Reza Rezaee & Jamiu Ekundayo, 2022. "Permeability Prediction Using Machine Learning Methods for the CO 2 Injectivity of the Precipice Sandstone in Surat Basin, Australia," Energies, MDPI, vol. 15(6), pages 1-15, March.
    9. Nica-Avram, Georgiana & Harvey, John & Smith, Gavin & Smith, Andrew & Goulding, James, 2021. "Identifying food insecurity in food sharing networks via machine learning," Journal of Business Research, Elsevier, vol. 131(C), pages 469-484.
    10. Kristof Lommers & Ouns El Harzli & Jack Kim, 2021. "Confronting Machine Learning With Financial Research," Papers 2103.00366, arXiv.org, revised Mar 2021.
    11. Carlo Dindorf & Eva Bartaguiz & Freya Gassmann & Michael Fröhlich, 2022. "Conceptual Structure and Current Trends in Artificial Intelligence, Machine Learning, and Deep Learning Research in Sports: A Bibliometric Review," IJERPH, MDPI, vol. 20(1), pages 1-23, December.
    12. Zhou, Huanyu & Qiu, Yingning & Feng, Yanhui & Liu, Jing, 2022. "Power prediction of wind turbine in the wake using hybrid physical process and machine learning models," Renewable Energy, Elsevier, vol. 198(C), pages 568-586.
    13. Bhattacharjee, Biplab & Kumar, Rajiv & Senthilkumar, Arunachalam, 2022. "Unidirectional and bidirectional LSTM models for edge weight predictions in dynamic cross-market equity networks," International Review of Financial Analysis, Elsevier, vol. 84(C).
    14. Muhammad Tanveer Islam & Sartaj Aziz Turja & Md Tawfiqul Islam & Md Mominur Rahman & Ahsan Habib, 2025. "Forecasting Tetouan energy demand employing shift approach in machine-learning: complementing econometric insights," Quality & Quantity: International Journal of Methodology, Springer, vol. 59(2), pages 1833-1860, April.
    15. Mahdi Goldani & Soraya Asadi Tirvan, 2024. "Sensitivity Assessing to Data Volume for forecasting: introducing similarity methods as a suitable one in Feature selection methods," Papers 2406.04390, arXiv.org.
    16. Alexis H. Villacis & Syed Badruddoza & Ashok K. Mishra, 2024. "A machine learning‐based exploration of resilience and food security," Applied Economic Perspectives and Policy, John Wiley & Sons, vol. 46(4), pages 1479-1505, December.
    17. Xiaofeng Xu & Zhaoyuan Chen & Shixiang Chen, 2023. "Enhancing economic competitiveness analysis through machine learning: Exploring complex urban features," PLOS ONE, Public Library of Science, vol. 18(11), pages 1-27, November.
    18. Qianru Qi & Rongjun Cheng & Hongxia Ge, 2022. "Short-Term Travel Demand Prediction of Online Ride-Hailing Based on Multi-Factor GRU Model," Sustainability, MDPI, vol. 14(7), pages 1-15, March.
    19. Shravankumar Shivappa Masalvad & Chidanand Patil & Akkaram Pravalika & Basavaraj Katageri & Purandara Bekal & Prashant Patil & Nagraj Hegde & Uttam Kumar Sahoo & Praveen Kumar Sakare, 2024. "Application of geospatial technology for the land use/land cover change assessment and future change predictions using CA Markov chain model," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 26(10), pages 24817-24842, October.
    20. Qiaoyang Li & Guiming Chen, 2021. "Recognition of industrial machine parts based on transfer learning with convolutional neural network," PLOS ONE, Public Library of Science, vol. 16(1), pages 1-21, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0301276. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.