IDEAS home Printed from https://ideas.repec.org/a/jss/jstsof/v061i01.html
   My bibliography  Save this article

evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R

Author

Listed:
  • Grubinger, Thomas
  • Zeileis, Achim
  • Pfeiffer, Karl-Peter

Abstract

Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. This paper describes the evtree package, which implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. Computationally intensive tasks are fully computed in C++ while the partykit package is leveraged for representing the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions. evtree is compared to the open-source CART implementation rpart, conditional inference trees (ctree), and the open-source C4.5 implementation J48. A benchmark study of predictive accuracy and complexity is carried out in which evtree achieved at least similar and most of the time better results compared to rpart, ctree, and J48. Furthermore, the usefulness of evtree in practice is illustrated in a textbook customer classification task.

Suggested Citation

  • Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
  • Handle: RePEc:jss:jstsof:v:061:i01
    DOI: http://hdl.handle.net/10.18637/jss.v061.i01
    as

    Download full text from publisher

    File URL: https://www.jstatsoft.org/index.php/jss/article/view/v061i01/v61i01.pdf
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v061i01/evtree_1.0-0.tar.gz
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v061i01/v61i01.R
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v061i01/v61i01-benchmark.zip
    Download Restriction: no

    File URL: https://libkey.io/http://hdl.handle.net/10.18637/jss.v061.i01?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Kurt Hornik & Christian Buchta & Achim Zeileis, 2009. "Open-source machine learning: R meets Weka," Computational Statistics, Springer, vol. 24(2), pages 225-232, May.
    2. Torsten Hothorn & Achim Zeileis, 2014. "partykit: A Modular Toolkit for Recursive Partytioning in R," Working Papers 2014-10, Faculty of Economics and Statistics, Universität Innsbruck.
    3. Scrucca, Luca, 2013. "GA: A Package for Genetic Algorithms in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 53(i04).
    4. Calcagno, Vincent & de Mazancourt, Claire, 2010. "glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 34(i12).
    5. Karatzoglou, Alexandros & Smola, Alexandros & Hornik, Kurt & Zeileis, Achim, 2004. "kernlab - An S4 Package for Kernel Methods in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 11(i09).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Susan Athey & Stefan Wager, 2021. "Policy Learning With Observational Data," Econometrica, Econometric Society, vol. 89(1), pages 133-161, January.
    2. Yves Staudt & Joël Wagner, 2021. "Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance," Risks, MDPI, vol. 9(3), pages 1-28, March.
    3. Max Tabord-Meehan, 2018. "Stratification Trees for Adaptive Randomization in Randomized Controlled Trials," Papers 1806.05127, arXiv.org, revised Jul 2022.
    4. Vrigazova Borislava, 2021. "The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems," Business Systems Research, Sciendo, vol. 12(1), pages 228-242, May.
    5. Islam, Towhidul & Meade, Nigel & Carson, Richard T. & Louviere, Jordan J. & Wang, Juan, 2022. "The usefulness of socio-demographic variables in predicting purchase decisions: Evidence from machine learning procedures," Journal of Business Research, Elsevier, vol. 151(C), pages 324-338.
    6. Emmanuel Jordy Menvouta & Jolien Ponnet & Robin Van Oirbeek & Tim Verdonck, 2022. "mCube: Multinomial Micro-level reserving Model," Papers 2212.00101, arXiv.org.
    7. Yagli, Gokhan Mert & Yang, Dazhi & Srinivasan, Dipti, 2019. "Automatic hourly solar forecasting using machine learning models," Renewable and Sustainable Energy Reviews, Elsevier, vol. 105(C), pages 487-498.
    8. Alvarez-Iglesias, Alberto & Hinde, John & Ferguson, John & Newell, John, 2017. "An alternative pruning based approach to unbiased recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 106(C), pages 90-102.
    9. Ronilo Ragodos & Tong Wang, 2022. "Disjunctive Rule Lists," INFORMS Journal on Computing, INFORMS, vol. 34(6), pages 3259-3276, November.
    10. Claudio Conversano & Elise Dusseldorp, 2017. "Modeling Threshold Interaction Effects Through the Logistic Classification Trunk," Journal of Classification, Springer;The Classification Society, vol. 34(3), pages 399-426, October.
    11. Federico Divina & Miguel García Torres & Francisco A. Goméz Vela & José Luis Vázquez Noguera, 2019. "A Comparative Study of Time Series Forecasting Methods for Short Term Electric Energy Consumption Prediction in Smart Buildings," Energies, MDPI, vol. 12(10), pages 1-23, May.
    12. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    13. Patrick Rehill & Nicholas Biddle, 2022. "Policy learning for many outcomes of interest: Combining optimal policy trees with multi-objective Bayesian optimisation," Papers 2212.06312, arXiv.org, revised Oct 2023.
    14. Federico Divina & Aude Gilson & Francisco Goméz-Vela & Miguel García Torres & José F. Torres, 2018. "Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting," Energies, MDPI, vol. 11(4), pages 1-31, April.
    15. Höppner, Sebastiaan & Stripling, Eugen & Baesens, Bart & Broucke, Seppe vanden & Verdonck, Tim, 2020. "Profit driven decision trees for churn prediction," European Journal of Operational Research, Elsevier, vol. 284(3), pages 920-933.
    16. Roberto Chiosa & Marco Savino Piscitelli & Alfonso Capozzoli, 2021. "A Data Analytics-Based Energy Information System (EIS) Tool to Perform Meter-Level Anomaly Detection and Diagnosis in Buildings," Energies, MDPI, vol. 14(1), pages 1-28, January.
    17. Anja Breuer & Yves Staudt, 2022. "Equalization Reserves for Reinsurance and Non-Life Undertakings in Switzerland," Risks, MDPI, vol. 10(3), pages 1-41, March.
    18. Davide Natalini & Giangiacomo Bravo & Aled Wynne Jones, 2019. "Global food security and food riots – an agent-based modelling approach," Food Security: The Science, Sociology and Economics of Food Production and Access to Food, Springer;The International Society for Plant Pathology, vol. 11(5), pages 1153-1173, October.
    19. Chi-Chang Chang & Tse-Hung Huang & Pei-Wei Shueng & Ssu-Han Chen & Chun-Chia Chen & Chi-Jie Lu & Yi-Ju Tseng, 2021. "Developing a Stacked Ensemble-Based Classification Scheme to Predict Second Primary Cancers in Head and Neck Cancer Survivors," IJERPH, MDPI, vol. 18(23), pages 1-10, November.
    20. Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
    21. Hajko, Vladimír, 2017. "The failure of Energy-Economy Nexus: A meta-analysis of 104 studies," Energy, Elsevier, vol. 125(C), pages 771-787.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tsukioka, Yasutomo & Yanagi, Junya & Takada, Teruko, 2018. "Investor sentiment extracted from internet stock message boards and IPO puzzles," International Review of Economics & Finance, Elsevier, vol. 56(C), pages 205-217.
    2. Bergeaud, Antonin & Raimbault, Juste, 2020. "An empirical analysis of the spatial variability of fuel prices in the United States," Transportation Research Part A: Policy and Practice, Elsevier, vol. 132(C), pages 131-143.
    3. Bernard W T Coetzee & Kevin J Gaston & Steven L Chown, 2014. "Local Scale Comparisons of Biodiversity as a Test for Global Protected Area Ecological Performance: A Meta-Analysis," PLOS ONE, Public Library of Science, vol. 9(8), pages 1-11, August.
    4. Daniel J. Luckett & Eric B. Laber & Samer S. El‐Kamary & Cheng Fan & Ravi Jhaveri & Charles M. Perou & Fatma M. Shebl & Michael R. Kosorok, 2021. "Receiver operating characteristic curves and confidence bands for support vector machines," Biometrics, The International Biometric Society, vol. 77(4), pages 1422-1430, December.
    5. Souhila Ghanem & Raphaël Couturier & Pablo Gregori, 2021. "An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity and Majority Vote," Mathematics, MDPI, vol. 9(12), pages 1-12, June.
    6. Grabisch, Michel & Kojadinovic, Ivan & Meyer, Patrick, 2008. "A review of methods for capacity identification in Choquet integral based multi-attribute utility theory: Applications of the Kappalab R package," European Journal of Operational Research, Elsevier, vol. 186(2), pages 766-785, April.
    7. Lazzari, Florencia & Mor, Gerard & Cipriano, Jordi & Solsona, Francesc & Chemisana, Daniel & Guericke, Daniela, 2023. "Optimizing planning and operation of renewable energy communities with genetic algorithms," Applied Energy, Elsevier, vol. 338(C).
    8. Bellotti, Anthony & Brigo, Damiano & Gambetti, Paolo & Vrins, Frédéric, 2021. "Forecasting recovery rates on non-performing loans with machine learning," International Journal of Forecasting, Elsevier, vol. 37(1), pages 428-444.
    9. Eduardo Correia & Rodrigo Calili & José Francisco Pessanha & Maria Fatima Almeida, 2023. "Definition of Regulatory Targets for Electricity Non-Technical Losses: Proposition of an Automatic Model-Selection Technique for Panel Data Regressions," Energies, MDPI, vol. 16(6), pages 1-22, March.
    10. Riza, Lala Septem & Bergmeir, Christoph & Herrera, Francisco & Benítez, José M., 2015. "frbs: Fuzzy Rule-Based Systems for Classification and Regression in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 65(i06).
    11. Karin Wolffhechel & Amanda C Hahn & Hanne Jarmer & Claire I Fisher & Benedict C Jones & Lisa M DeBruine, 2015. "Testing the Utility of a Data-Driven Approach for Assessing BMI from Face Images," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-10, October.
    12. Olgun Aydin & Bartłomiej Igliński & Krzysztof Krukowski & Marek Siemiński, 2022. "Analyzing Wind Energy Potential Using Efficient Global Optimization: A Case Study for the City Gdańsk in Poland," Energies, MDPI, vol. 15(9), pages 1-22, April.
    13. Scrucca, Luca, 2013. "GA: A Package for Genetic Algorithms in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 53(i04).
    14. Castellares, Fredy & Patrício, Silvio C. & Lemonte, Artur J. & Queiroz, Bernardo L., 2020. "On closed-form expressions to Gompertz–Makeham life expectancy," Theoretical Population Biology, Elsevier, vol. 134(C), pages 53-60.
    15. Dirick, Lore & Claeskens, Gerda & Baesens, Bart, 2015. "An Akaike information criterion for multiple event mixture cure models," European Journal of Operational Research, Elsevier, vol. 241(2), pages 449-457.
    16. Huan Yu & Jun Yang & Yu Zhao, 2018. "Reliability of nonrepairable phased-mission systems with common bus performance sharing," Journal of Risk and Reliability, , vol. 232(6), pages 647-660, December.
    17. Ji, Yonggang & Lin, Nan & Zhang, Baoxue, 2012. "Model selection in binary and tobit quantile regression using the Gibbs sampler," Computational Statistics & Data Analysis, Elsevier, vol. 56(4), pages 827-839.
    18. Muhammet Burak Kılıç & Yusuf Şahin & Melih Burak Koca, 2021. "Genetic algorithm approach with an adaptive search space based on EM algorithm in two-component mixture Weibull parameter estimation," Computational Statistics, Springer, vol. 36(2), pages 1219-1242, June.
    19. Andrea S Martinez-Vernon & James A Covington & Ramesh P Arasaradnam & Siavash Esfahani & Nicola O’Connell & Ioannis Kyrou & Richard S Savage, 2018. "An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-20, September.
    20. Khamma, Thulasi Ram & Zhang, Yuming & Guerrier, Stéphane & Boubekri, Mohamed, 2020. "Generalized additive models: An efficient method for short-term energy prediction in office buildings," Energy, Elsevier, vol. 213(C).

    More about this item

    JEL classification:

    • C14 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Semiparametric and Nonparametric Methods: General
    • C45 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods: Special Topics - - - Neural Networks and Related Topics
    • C87 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Econometric Software

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:jss:jstsof:v:061:i01. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F. Baum (email available below). General contact details of provider: http://www.jstatsoft.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.