IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0224365.html
   My bibliography  Save this article

Machine learning algorithm validation with a limited sample size

Author

Listed:
  • Andrius Vabalas
  • Emma Gowen
  • Ellen Poliakoff
  • Alexander J Casson

Abstract

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Suggested Citation

  • Andrius Vabalas & Emma Gowen & Ellen Poliakoff & Alexander J Casson, 2019. "Machine learning algorithm validation with a limited sample size," PLOS ONE, Public Library of Science, vol. 14(11), pages 1-20, November.
  • Handle: RePEc:plo:pone00:0224365
    DOI: 10.1371/journal.pone.0224365
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224365
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0224365&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0224365?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Qiaoyang Li & Guiming Chen, 2021. "Recognition of industrial machine parts based on transfer learning with convolutional neural network," PLOS ONE, Public Library of Science, vol. 16(1), pages 1-21, January.
    2. Bhattacharjee, Biplab & Kumar, Rajiv & Senthilkumar, Arunachalam, 2022. "Unidirectional and bidirectional LSTM models for edge weight predictions in dynamic cross-market equity networks," International Review of Financial Analysis, Elsevier, vol. 84(C).
    3. Ephrem Habyarimana & Faheem S Baloch, 2021. "Machine learning models based on remote and proximal sensing as potential methods for in-season biomass yields prediction in commercial sorghum fields," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-23, March.
    4. Giannakeas, Ilias N. & Mazaheri, Fatemeh & Bacarreza, Omar & Khodaei, Zahra Sharif & Aliabadi, Ferri M.H., 2023. "Probabilistic residual strength assessment of smart composite aircraft panels using guided waves," Reliability Engineering and System Safety, Elsevier, vol. 237(C).
    5. Min Yang & Baiyu Zhang & Yifu Chen & Xiaying Xin & Kenneth Lee & Bing Chen, 2021. "Impact of Microplastics on Oil Dispersion Efficiency in the Marine Environment," Sustainability, MDPI, vol. 13(24), pages 1-13, December.
    6. Twumasi, Clement & Twumasi, Juliet, 2022. "Machine learning algorithms for forecasting and backcasting blood demand data with missing values and outliers: A study of Tema General Hospital of Ghana," International Journal of Forecasting, Elsevier, vol. 38(3), pages 1258-1277.
    7. Reza Rezaee & Jamiu Ekundayo, 2022. "Permeability Prediction Using Machine Learning Methods for the CO 2 Injectivity of the Precipice Sandstone in Surat Basin, Australia," Energies, MDPI, vol. 15(6), pages 1-15, March.
    8. Nica-Avram, Georgiana & Harvey, John & Smith, Gavin & Smith, Andrew & Goulding, James, 2021. "Identifying food insecurity in food sharing networks via machine learning," Journal of Business Research, Elsevier, vol. 131(C), pages 469-484.
    9. Leandro C. Hermida & E. Michael Gertz & Eytan Ruppin, 2022. "Predicting cancer prognosis and drug response from the tumor microbiome," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    10. Jonathan C. M. Wan & Dennis Stephens & Lingqi Luo & James R. White & Caitlin M. Stewart & Benoît Rousseau & Dana W. Y. Tsui & Luis A. Diaz, 2022. "Genome-wide mutational signatures in low-coverage whole genome sequencing of cell-free DNA," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    11. Michael D. Wang & Jie Lou & Dong Zhang & C. Simon Fan, 2022. "Measuring political and economic uncertainty: a supervised computational linguistic approach," SN Business & Economics, Springer, vol. 2(5), pages 1-17, May.
    12. Kristof Lommers & Ouns El Harzli & Jack Kim, 2021. "Confronting Machine Learning With Financial Research," Papers 2103.00366, arXiv.org, revised Mar 2021.
    13. Qianru Qi & Rongjun Cheng & Hongxia Ge, 2022. "Short-Term Travel Demand Prediction of Online Ride-Hailing Based on Multi-Factor GRU Model," Sustainability, MDPI, vol. 14(7), pages 1-15, March.
    14. Francisco Gatica-Neira & Mario Ramos-Maldonado, 2022. "Limits to the Productivity in Biobased Territorial SMEs," SAGE Open, , vol. 12(2), pages 21582440221, May.
    15. Carlo Dindorf & Eva Bartaguiz & Freya Gassmann & Michael Fröhlich, 2022. "Conceptual Structure and Current Trends in Artificial Intelligence, Machine Learning, and Deep Learning Research in Sports: A Bibliometric Review," IJERPH, MDPI, vol. 20(1), pages 1-23, December.
    16. Zhou, Huanyu & Qiu, Yingning & Feng, Yanhui & Liu, Jing, 2022. "Power prediction of wind turbine in the wake using hybrid physical process and machine learning models," Renewable Energy, Elsevier, vol. 198(C), pages 568-586.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0224365. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.