IDEAS home Printed from https://ideas.repec.org/a/eee/ecomod/v483y2023ics030438002300145x.html
   My bibliography  Save this article

Correcting for the effects of class imbalance improves the performance of machine-learning based species distribution models

Author

Listed:
  • Benkendorf, Donald J.
  • Schwartz, Samuel D.
  • Cutler, D. Richard
  • Hawkins, Charles P.

Abstract

Numerous methods have been developed to combat the unwanted effects of imbalanced training data on the performance of machine-learning based predictive models. These methods attempt to balance model sensitivity and specificity. However, the effects of specific imbalance-correction methods on the performance of different machine-learning algorithms are not well understood for ecological data. In this study, we used four machine-learning algorithms (random forest, artificial neural network, gradient boosting, support vector machine) and five imbalance-correction methods (base algorithm = no correction, cutoff, up-sampling, down-sampling, weighting) to produce species distribution models for 15 freshwater macroinvertebrate genera that varied from 2.5 to 29.0% in prevalence. All imbalance-correction methods substantially improved average model performance (true skill statistic) over the base machine-learning algorithms, except when up-sampling was applied to random forest models. Choice of machine-learning algorithm had little effect on model performance, although gradient boosting performed better than other algorithms on the most imbalanced datasets. Our results suggest that the performance of species distribution models built with presence/absence data can generally be improved by correcting for imbalanced data.

Suggested Citation

  • Benkendorf, Donald J. & Schwartz, Samuel D. & Cutler, D. Richard & Hawkins, Charles P., 2023. "Correcting for the effects of class imbalance improves the performance of machine-learning based species distribution models," Ecological Modelling, Elsevier, vol. 483(C).
  • Handle: RePEc:eee:ecomod:v:483:y:2023:i:c:s030438002300145x
    DOI: 10.1016/j.ecolmodel.2023.110414
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S030438002300145X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ecolmodel.2023.110414?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Barber-O'Malley, Betsy & Lassalle, Géraldine & Chust, Guillem & Diaz, Estibaliz & O'Malley, Andrew & Paradinas Blázquez, César & Pórtoles Marquina, Javier & Lambert, Patrick, 2022. "HyDiaD: A hybrid species distribution model combining dispersal, multi-habitat suitability, and population dynamics for diadromous species under climate change scenarios," Ecological Modelling, Elsevier, vol. 470(C).
    2. Wieland, Ralf & Kerkow, Antje & Früh, Linus & Kampen, Helge & Walther, Doreen, 2017. "Automated feature selection for a machine learning approach toward modeling a mosquito distribution," Ecological Modelling, Elsevier, vol. 352(C), pages 108-112.
    3. Barker, Justin R. & MacIsaac, Hugh J., 2022. "Species distribution models: Administrative boundary centroid occurrences require careful interpretation," Ecological Modelling, Elsevier, vol. 472(C).
    4. Freeman, Elizabeth A. & Moisen, Gretchen G., 2008. "A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa," Ecological Modelling, Elsevier, vol. 217(1), pages 48-58.
    5. De Cubber, Lola & Trenkel, Verena M. & Diez, Guzman & Gil-Herrera, Juan & Novoa Pabon, Ana Maria & Eme, David & Lorance, Pascal, 2023. "Robust identification of potential habitats of a rare demersal species (blackspot seabream) in the Northeast Atlantic," Ecological Modelling, Elsevier, vol. 477(C).
    6. Freeman, Elizabeth A. & Moisen, Gretchen G. & Frescino, Tracey S., 2012. "Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada," Ecological Modelling, Elsevier, vol. 233(C), pages 1-10.
    7. Ronald L. Wasserstein & Nicole A. Lazar, 2016. "The ASA's Statement on p -Values: Context, Process, and Purpose," The American Statistician, Taylor & Francis Journals, vol. 70(2), pages 129-133, May.
    8. Marchetto, Elisa & Da Re, Daniele & Tordoni, Enrico & Bazzichetto, Manuele & Zannini, Piero & Celebrin, Simone & Chieffallo, Ludovico & Malavasi, Marco & Rocchini, Duccio, 2023. "Testing the effect of sample prevalence and sampling methods on probability- and favourability-based SDMs," Ecological Modelling, Elsevier, vol. 477(C).
    9. Abdulwahab, Umarfarooq A. & Hammill, Edd & Hawkins, Charles P., 2022. "Choice of climate data affects the performance and interpretation of species distribution models," Ecological Modelling, Elsevier, vol. 471(C).
    10. Sor, Ratha & Park, Young-Seuk & Boets, Pieter & Goethals, Peter L.M. & Lek, Sovan, 2017. "Effects of species prevalence on the performance of predictive models," Ecological Modelling, Elsevier, vol. 354(C), pages 11-19.
    11. Gobeyn, Sacha & Mouton, Ans M. & Cord, Anna F. & Kaim, Andrea & Volk, Martin & Goethals, Peter L.M., 2019. "Evolutionary algorithms for species distribution modelling: A review in the context of machine learning," Ecological Modelling, Elsevier, vol. 392(C), pages 179-195.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jyotirmoy Sarkar, 2018. "Will P†Value Triumph over Abuses and Attacks?," Biostatistics and Biometrics Open Access Journal, Juniper Publishers Inc., vol. 7(4), pages 66-71, July.
    2. Václavík, Tomáš & Meentemeyer, Ross K., 2009. "Invasive species distribution modeling (iSDM): Are absence data and dispersal constraints needed to predict actual distributions?," Ecological Modelling, Elsevier, vol. 220(23), pages 3248-3258.
    3. Wu, Jiang & Ou, Guiyan & Liu, Xiaohui & Dong, Ke, 2022. "How does academic education background affect top researchers’ performance? Evidence from the field of artificial intelligence," Journal of Informetrics, Elsevier, vol. 16(2).
    4. Segurado, Pedro & Gutiérrez-Cánovas, Cayetano & Ferreira, Teresa & Branco, Paulo, 2022. "Stressor gradient coverage affects interaction identification," Ecological Modelling, Elsevier, vol. 472(C).
    5. Wieland, Ralf & Kuhls, Katrin & Lentz, Hartmut H.K. & Conraths, Franz & Kampen, Helge & Werner, Doreen, 2021. "Combined climate and regional mosquito habitat model based on machine learning," Ecological Modelling, Elsevier, vol. 452(C).
    6. Gergely Ganics & Atsushi Inoue & Barbara Rossi, 2021. "Confidence Intervals for Bias and Size Distortion in IV and Local Projections-IV Models," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 39(1), pages 307-324, January.
    7. Oliver Schilke & Sheen S. Levine & Olenka Kacperczyk & Lynne G. Zucker, 2019. "Call for Papers-Special Issue on Experiments in Organizational Theory," Organization Science, INFORMS, vol. 30(1), pages 232-234, February.
    8. Lopez, Belen & Rangel, Celia & Fernández, Manuel, 2022. "The impact of corporate social responsibility strategy on the management and governance axis for sustainable growth," Journal of Business Research, Elsevier, vol. 150(C), pages 690-698.
    9. Michaelides, Michael, 2021. "Large sample size bias in empirical finance," Finance Research Letters, Elsevier, vol. 41(C).
    10. Kelter, Riko, 2022. "Power analysis and type I and type II error rates of Bayesian nonparametric two-sample tests for location-shifts based on the Bayes factor under Cauchy priors," Computational Statistics & Data Analysis, Elsevier, vol. 165(C).
    11. Xian Jin Xie, 2019. "Research Reproducibility and p-value Threshold," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 22(5), pages 16934-16936, November.
    12. Chatelain, Jean-Bernard & Ralf, Kirsten, 2021. "Inference on time-invariant variables using panel data: A pretest estimator," Economic Modelling, Elsevier, vol. 97(C), pages 157-166.
    13. Karmakar, Bisheswar & Pal, Sucharita & Gopikrishna, Konga & Tiwari, Onkar Nath & Halder, Gopinath, 2022. "Injection of superheated C1 and C3 alcohols in non-edible Pongamia pinnata oil for semi-continuous uncatalyzed biodiesel synthesis," Renewable Energy, Elsevier, vol. 185(C), pages 850-861.
    14. Maurizio Canavari & Andreas C. Drichoutis & Jayson L. Lusk & Rodolfo M. Nayga, Jr., 2018. "How to run an experimental auction: A review of recent advances," Working Papers 2018-5, Agricultural University of Athens, Department Of Agricultural Economics.
    15. Ben Moews & J. Michael Herrmann & Gbenga Ibikunle, 2018. "Lagged correlation-based deep learning for directional trend change prediction in financial time series," Papers 1811.11287, arXiv.org, revised Nov 2018.
    16. David Spiegelhalter, 2017. "Trust in numbers," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 180(4), pages 948-965, October.
    17. Eszter Czibor & David Jimenez‐Gomez & John A. List, 2019. "The Dozen Things Experimental Economists Should Do (More of)," Southern Economic Journal, John Wiley & Sons, vol. 86(2), pages 371-432, October.
    18. Haas Franz, 2016. "Reappraisal of Austrian Business Confidence Survey 2015 for Mainland China," Proceedings of FIKUSZ 2016, in: Regina Zsuzsánna Reicher (ed.),Proceedings of FIKUSZ '16, pages 57-64, Óbuda University, Keleti Faculty of Business and Management.
    19. Robert Rieg, 2018. "Tasks, interaction and role perception of management accountants: evidence from Germany," Journal of Management Control: Zeitschrift für Planung und Unternehmenssteuerung, Springer, vol. 29(2), pages 183-220, August.
    20. Bertoldi, Paolo & Mosconi, Rocco, 2020. "Do energy efficiency policies save energy? A new approach based on energy policy indicators (in the EU Member States)," Energy Policy, Elsevier, vol. 139(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ecomod:v:483:y:2023:i:c:s030438002300145x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.journals.elsevier.com/ecological-modelling .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.