IDEAS home Printed from https://ideas.repec.org/a/eee/ecomod/v483y2023ics030438002300145x.html
   My bibliography  Save this article

Correcting for the effects of class imbalance improves the performance of machine-learning based species distribution models

Author

Listed:
  • Benkendorf, Donald J.
  • Schwartz, Samuel D.
  • Cutler, D. Richard
  • Hawkins, Charles P.

Abstract

Numerous methods have been developed to combat the unwanted effects of imbalanced training data on the performance of machine-learning based predictive models. These methods attempt to balance model sensitivity and specificity. However, the effects of specific imbalance-correction methods on the performance of different machine-learning algorithms are not well understood for ecological data. In this study, we used four machine-learning algorithms (random forest, artificial neural network, gradient boosting, support vector machine) and five imbalance-correction methods (base algorithm = no correction, cutoff, up-sampling, down-sampling, weighting) to produce species distribution models for 15 freshwater macroinvertebrate genera that varied from 2.5 to 29.0% in prevalence. All imbalance-correction methods substantially improved average model performance (true skill statistic) over the base machine-learning algorithms, except when up-sampling was applied to random forest models. Choice of machine-learning algorithm had little effect on model performance, although gradient boosting performed better than other algorithms on the most imbalanced datasets. Our results suggest that the performance of species distribution models built with presence/absence data can generally be improved by correcting for imbalanced data.

Suggested Citation

  • Benkendorf, Donald J. & Schwartz, Samuel D. & Cutler, D. Richard & Hawkins, Charles P., 2023. "Correcting for the effects of class imbalance improves the performance of machine-learning based species distribution models," Ecological Modelling, Elsevier, vol. 483(C).
  • Handle: RePEc:eee:ecomod:v:483:y:2023:i:c:s030438002300145x
    DOI: 10.1016/j.ecolmodel.2023.110414
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S030438002300145X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ecolmodel.2023.110414?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Barber-O'Malley, Betsy & Lassalle, Géraldine & Chust, Guillem & Diaz, Estibaliz & O'Malley, Andrew & Paradinas Blázquez, César & Pórtoles Marquina, Javier & Lambert, Patrick, 2022. "HyDiaD: A hybrid species distribution model combining dispersal, multi-habitat suitability, and population dynamics for diadromous species under climate change scenarios," Ecological Modelling, Elsevier, vol. 470(C).
    2. Wieland, Ralf & Kerkow, Antje & Früh, Linus & Kampen, Helge & Walther, Doreen, 2017. "Automated feature selection for a machine learning approach toward modeling a mosquito distribution," Ecological Modelling, Elsevier, vol. 352(C), pages 108-112.
    3. Barker, Justin R. & MacIsaac, Hugh J., 2022. "Species distribution models: Administrative boundary centroid occurrences require careful interpretation," Ecological Modelling, Elsevier, vol. 472(C).
    4. Freeman, Elizabeth A. & Moisen, Gretchen G., 2008. "A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa," Ecological Modelling, Elsevier, vol. 217(1), pages 48-58.
    5. De Cubber, Lola & Trenkel, Verena M. & Diez, Guzman & Gil-Herrera, Juan & Novoa Pabon, Ana Maria & Eme, David & Lorance, Pascal, 2023. "Robust identification of potential habitats of a rare demersal species (blackspot seabream) in the Northeast Atlantic," Ecological Modelling, Elsevier, vol. 477(C).
    6. Freeman, Elizabeth A. & Moisen, Gretchen G. & Frescino, Tracey S., 2012. "Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada," Ecological Modelling, Elsevier, vol. 233(C), pages 1-10.
    7. Marchetto, Elisa & Da Re, Daniele & Tordoni, Enrico & Bazzichetto, Manuele & Zannini, Piero & Celebrin, Simone & Chieffallo, Ludovico & Malavasi, Marco & Rocchini, Duccio, 2023. "Testing the effect of sample prevalence and sampling methods on probability- and favourability-based SDMs," Ecological Modelling, Elsevier, vol. 477(C).
    8. Abdulwahab, Umarfarooq A. & Hammill, Edd & Hawkins, Charles P., 2022. "Choice of climate data affects the performance and interpretation of species distribution models," Ecological Modelling, Elsevier, vol. 471(C).
    9. Ronald L. Wasserstein & Nicole A. Lazar, 2016. "The ASA's Statement on p -Values: Context, Process, and Purpose," The American Statistician, Taylor & Francis Journals, vol. 70(2), pages 129-133, May.
    10. Sor, Ratha & Park, Young-Seuk & Boets, Pieter & Goethals, Peter L.M. & Lek, Sovan, 2017. "Effects of species prevalence on the performance of predictive models," Ecological Modelling, Elsevier, vol. 354(C), pages 11-19.
    11. Gobeyn, Sacha & Mouton, Ans M. & Cord, Anna F. & Kaim, Andrea & Volk, Martin & Goethals, Peter L.M., 2019. "Evolutionary algorithms for species distribution modelling: A review in the context of machine learning," Ecological Modelling, Elsevier, vol. 392(C), pages 179-195.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jyotirmoy Sarkar, 2018. "Will P†Value Triumph over Abuses and Attacks?," Biostatistics and Biometrics Open Access Journal, Juniper Publishers Inc., vol. 7(4), pages 66-71, July.
    2. Václavík, Tomáš & Meentemeyer, Ross K., 2009. "Invasive species distribution modeling (iSDM): Are absence data and dispersal constraints needed to predict actual distributions?," Ecological Modelling, Elsevier, vol. 220(23), pages 3248-3258.
    3. Wu, Jiang & Ou, Guiyan & Liu, Xiaohui & Dong, Ke, 2022. "How does academic education background affect top researchers’ performance? Evidence from the field of artificial intelligence," Journal of Informetrics, Elsevier, vol. 16(2).
    4. Zhang, Quanzhong & Wei, Haiyan & Liu, Jing & Zhao, Zefang & Ran, Qiao & Gu, Wei, 2021. "A Bayesian network with fuzzy mathematics for species habitat suitability analysis: A case with limited Angelica sinensis (Oliv.) Diels data," Ecological Modelling, Elsevier, vol. 450(C).
    5. Chatelain, Jean-Bernard & Ralf, Kirsten, 2018. "Publish and Perish: Creative Destruction and Macroeconomic Theory," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 46(2), pages 65-101.
    6. Segurado, Pedro & Gutiérrez-Cánovas, Cayetano & Ferreira, Teresa & Branco, Paulo, 2022. "Stressor gradient coverage affects interaction identification," Ecological Modelling, Elsevier, vol. 472(C).
    7. Wieland, Ralf & Kuhls, Katrin & Lentz, Hartmut H.K. & Conraths, Franz & Kampen, Helge & Werner, Doreen, 2021. "Combined climate and regional mosquito habitat model based on machine learning," Ecological Modelling, Elsevier, vol. 452(C).
    8. Uwe Hassler & Marc‐Oliver Pohle, 2022. "Unlucky Number 13? Manipulating Evidence Subject to Snooping," International Statistical Review, International Statistical Institute, vol. 90(2), pages 397-410, August.
    9. Kim, Jae H., 2017. "Stock returns and investors' mood: Good day sunshine or spurious correlation?," International Review of Financial Analysis, Elsevier, vol. 52(C), pages 94-103.
    10. Gergely Ganics & Atsushi Inoue & Barbara Rossi, 2021. "Confidence Intervals for Bias and Size Distortion in IV and Local Projections-IV Models," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 39(1), pages 307-324, January.
    11. Hirschauer, Norbert & Grüner, Sven & Mußhoff, Oliver & Becker, Claudia & Jantsch, Antje, 2020. "Can p-values be meaningfully interpreted without random sampling?," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 14, pages 71-91.
    12. Oliver Schilke & Sheen S. Levine & Olenka Kacperczyk & Lynne G. Zucker, 2019. "Call for Papers-Special Issue on Experiments in Organizational Theory," Organization Science, INFORMS, vol. 30(1), pages 232-234, February.
    13. Früh, Linus & Kampen, Helge & Kerkow, Antje & Schaub, Günter A. & Walther, Doreen & Wieland, Ralf, 2018. "Modelling the potential distribution of an invasive mosquito species: comparative evaluation of four machine learning methods and their combinations," Ecological Modelling, Elsevier, vol. 388(C), pages 136-144.
    14. Amaro, George & Fidelis, Elisangela Gomes & da Silva, Ricardo Siqueira & Marchioro, Cesar Augusto, 2023. "Effect of study area extent on the potential distribution of Species: A case study with models for Raoiella indica Hirst (Acari: Tenuipalpidae)," Ecological Modelling, Elsevier, vol. 483(C).
    15. Lopez, Belen & Rangel, Celia & Fernández, Manuel, 2022. "The impact of corporate social responsibility strategy on the management and governance axis for sustainable growth," Journal of Business Research, Elsevier, vol. 150(C), pages 690-698.
    16. Michaelides, Michael, 2021. "Large sample size bias in empirical finance," Finance Research Letters, Elsevier, vol. 41(C).
    17. Kelter, Riko, 2022. "Power analysis and type I and type II error rates of Bayesian nonparametric two-sample tests for location-shifts based on the Bayes factor under Cauchy priors," Computational Statistics & Data Analysis, Elsevier, vol. 165(C).
    18. Scott, E. Marian, 2018. "The role of Statistics in the era of big data: Crucial, critical and under-valued," Statistics & Probability Letters, Elsevier, vol. 136(C), pages 20-24.
    19. Xian Jin Xie, 2019. "Research Reproducibility and p-value Threshold," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 22(5), pages 16934-16936, November.
    20. Chatelain, Jean-Bernard & Ralf, Kirsten, 2021. "Inference on time-invariant variables using panel data: A pretest estimator," Economic Modelling, Elsevier, vol. 97(C), pages 157-166.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ecomod:v:483:y:2023:i:c:s030438002300145x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.journals.elsevier.com/ecological-modelling .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.