IDEAS home Printed from https://ideas.repec.org/a/spr/topjnl/v29y2021i1d10.1007_s11750-021-00594-1.html
   My bibliography  Save this article

Mathematical optimization in classification and regression trees

Author

Listed:
  • Emilio Carrizosa

    (Instituto de Matemáticas de la Universidad de Sevilla)

  • Cristina Molero-Río

    (Instituto de Matemáticas de la Universidad de Sevilla)

  • Dolores Romero Morales

    (Copenhagen Business School)

Abstract

Classification and regression trees, as well as their variants, are off-the-shelf methods in Machine Learning. In this paper, we review recent contributions within the Continuous Optimization and the Mixed-Integer Linear Optimization paradigms to develop novel formulations in this research area. We compare those in terms of the nature of the decision variables and the constraints required, as well as the optimization algorithms proposed. We illustrate how these powerful formulations enhance the flexibility of tree models, being better suited to incorporate desirable properties such as cost-sensitivity, explainability, and fairness, and to deal with complex data, such as functional data.

Suggested Citation

  • Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
  • Handle: RePEc:spr:topjnl:v:29:y:2021:i:1:d:10.1007_s11750-021-00594-1
    DOI: 10.1007/s11750-021-00594-1
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11750-021-00594-1
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11750-021-00594-1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Hung-Pin Kao & Kwei Tang, 2014. "Cost-Sensitive Decision Tree Induction with Label-Dependent Late Constraints," INFORMS Journal on Computing, INFORMS, vol. 26(2), pages 238-252, May.
    2. Stefan Wager & Susan Athey, 2018. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 1228-1242, July.
    3. Jiaming Zeng & Berk Ustun & Cynthia Rudin, 2017. "Interpretable classification models for recidivism prediction," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 180(3), pages 689-722, June.
    4. Blanquero, R. & Carrizosa, E. & Jiménez-Cordero, A. & Martín-Barragán, B., 2019. "Functional-bandwidth kernel for Support Vector Machine with Functional Data: An alternating optimization algorithm," European Journal of Operational Research, Elsevier, vol. 275(1), pages 195-207.
    5. Carrizosa, Emilio & Nogales-Gómez, Amaya & Romero Morales, Dolores, 2017. "Clustering categories in support vector machines," Omega, Elsevier, vol. 66(PA), pages 28-37.
    6. Athanasopoulos, George & Hyndman, Rob J. & Kourentzes, Nikolaos & Petropoulos, Fotios, 2017. "Forecasting with temporal hierarchies," European Journal of Operational Research, Elsevier, vol. 262(1), pages 60-74.
    7. Jongbin Jung & Connor Concannon & Ravi Shroff & Sharad Goel & Daniel G. Goldstein, 2020. "Simple rules to guide expert classifications," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 183(3), pages 771-800, June.
    8. Xiao Fang & Olivia R. Liu Sheng & Paulo Goes, 2013. "When Is the Right Time to Refresh Knowledge Discovered from Data?," Operations Research, INFORMS, vol. 61(1), pages 32-44, February.
    9. Laura Palagi, 2019. "Global optimization issues in deep network regression: an overview," Journal of Global Optimization, Springer, vol. 73(2), pages 239-277, February.
    10. Martens, David & Baesens, Bart & Van Gestel, Tony & Vanthienen, Jan, 2007. "Comprehensible credit scoring models using rule extraction from support vector machines," European Journal of Operational Research, Elsevier, vol. 183(3), pages 1466-1476, December.
    11. Kim H. & Loh W.Y., 2001. "Classification Trees With Unbiased Multiway Splits," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 589-604, June.
    12. Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
    13. Audrone Jakaitiene & Mara Sangiovanni & Mario R. Guarracino & Panos M. Pardalos, 2016. "Multidimensional Scaling for Genomic Data," Springer Optimization and Its Applications, in: Panos M. Pardalos & Anatoly Zhigljavsky & Julius Žilinskas (ed.), Advances in Stochastic and Deterministic Global Optimization, pages 129-139, Springer.
    14. G. V. Kass, 1980. "An Exploratory Technique for Investigating Large Quantities of Categorical Data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(2), pages 119-127, June.
    15. Friedman, Jerome H., 2002. "Stochastic gradient boosting," Computational Statistics & Data Analysis, Elsevier, vol. 38(4), pages 367-378, February.
    16. Wei-Yin Loh, 2014. "Fifty Years of Classification and Regression Trees," International Statistical Review, International Statistical Institute, vol. 82(3), pages 329-348, December.
    17. Zhiwei Fu & Bruce L. Golden & Shreevardhan Lele & S. Raghavan & Edward A. Wasil, 2003. "A Genetic Algorithm-Based Approach for Building Accurate Decision Trees," INFORMS Journal on Computing, INFORMS, vol. 15(1), pages 3-22, February.
    18. Leo Liberti, 2020. "Rejoinder on: Distance geometry and data science," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(2), pages 350-357, July.
    19. W. Nick Street, 2005. "Oblique Multicategory Decision Trees Using Nonlinear Programming," INFORMS Journal on Computing, INFORMS, vol. 17(1), pages 25-31, February.
    20. Jon Kleinberg & Himabindu Lakkaraju & Jure Leskovec & Jens Ludwig & Sendhil Mullainathan, 2018. "Human Decisions and Machine Predictions," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 133(1), pages 237-293.
    21. Gérard Biau & Erwan Scornet, 2016. "Rejoinder on: A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 264-268, June.
    22. Roger Koenker & Kevin F. Hallock, 2001. "Quantile Regression," Journal of Economic Perspectives, American Economic Association, vol. 15(4), pages 143-156, Fall.
    23. Carrizosa, Emilio & Olivares-Nadal, Alba V. & Ramírez-Cobo, Pepa, 2013. "Time series interpolation via global optimization of moments fitting," European Journal of Operational Research, Elsevier, vol. 230(1), pages 97-112.
    24. Scornet, Erwan, 2016. "On the asymptotics of random forests," Journal of Multivariate Analysis, Elsevier, vol. 146(C), pages 72-83.
    25. Barrow, Devon K. & Crone, Sven F., 2016. "A comparison of AdaBoost algorithms for time series forecast combination," International Journal of Forecasting, Elsevier, vol. 32(4), pages 1103-1119.
    26. Bart Baesens & Rudy Setiono & Christophe Mues & Jan Vanthienen, 2003. "Using Neural Network Rule Extraction and Decision Tables for Credit-Risk Evaluation," Management Science, INFORMS, vol. 49(3), pages 312-329, March.
    27. Gérard Biau & Erwan Scornet, 2016. "A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 197-227, June.
    28. Wickramarachchi, D.C. & Robertson, B.L. & Reale, M. & Price, C.J. & Brown, J., 2016. "HHCART: An oblique decision tree," Computational Statistics & Data Analysis, Elsevier, vol. 96(C), pages 12-23.
    29. Veronica Piccialli & Marco Sciandrone, 2018. "Nonlinear optimization and support vector machines," 4OR, Springer, vol. 16(2), pages 111-149, June.
    30. Leo Liberti, 2020. "Distance geometry and data science," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(2), pages 271-339, July.
    31. Dimitris Bertsimas & Romy Shioda, 2007. "Classification and Regression via Integer Optimization," Operations Research, INFORMS, vol. 55(2), pages 252-271, April.
    32. Höppner, Sebastiaan & Stripling, Eugen & Baesens, Bart & Broucke, Seppe vanden & Verdonck, Tim, 2020. "Profit driven decision trees for churn prediction," European Journal of Operational Research, Elsevier, vol. 284(3), pages 920-933.
    33. Pedro Duarte Silva, A., 2017. "Optimization approaches to Supervised Classification," European Journal of Operational Research, Elsevier, vol. 261(2), pages 772-788.
    34. Shanika L. Wickramasuriya & George Athanasopoulos & Rob J. Hyndman, 2019. "Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series Through Trace Minimization," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(526), pages 804-819, April.
    35. Dimitris Bertsimas & Allison O’Hair & Stephen Relyea & John Silberholz, 2016. "An Analytics Approach to Designing Combination Chemotherapy Regimens for Cancer," Management Science, INFORMS, vol. 62(5), pages 1511-1531, May.
    36. Yanou Ramon & David Martens & Foster Provost & Theodoros Evgeniou, 2020. "A comparison of instance-level counterfactual explanation algorithms for behavioral and textual data: SEDC, LIME-C and SHAP-C," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(4), pages 801-819, December.
    37. Jose Pangilinan & Gerrit Janssens, 2011. "Pareto-optimality of oblique decision trees from evolutionary algorithms," Journal of Global Optimization, Springer, vol. 51(2), pages 301-311, October.
    38. Véronique Van Vlasselaer & Tina Eliassi-Rad & Leman Akoglu & Monique Snoeck & Bart Baesens, 2017. "GOTCHA! Network-Based Fraud Detection for Social Security Fraud," Management Science, INFORMS, vol. 63(9), pages 3090-3110, September.
    39. Andrea Lodi & Giulia Zarpellon, 2017. "Rejoinder on: On learning and branching: a survey," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 247-248, July.
    40. Hanif D. Sherali & Antoine G. Hobeika & Chawalit Jeenanunta, 2009. "An Optimal Constrained Pruning Strategy for Decision Trees," INFORMS Journal on Computing, INFORMS, vol. 21(1), pages 49-61, February.
    41. Andrea Lodi & Giulia Zarpellon, 2017. "On learning and branching: a survey," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 207-236, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Blanquero, Rafael & Carrizosa, Emilio & Molero-Río, Cristina & Morales, Dolores Romero, 2022. "On sparse optimal regression trees," European Journal of Operational Research, Elsevier, vol. 299(3), pages 1045-1054.
    2. Carrizosa, Emilio & Kurishchenko, Kseniia & Marín, Alfredo & Romero Morales, Dolores, 2022. "Interpreting clusters via prototype optimization," Omega, Elsevier, vol. 107(C).
    3. Victor Blanco & Alberto Japón & Justo Puerto, 2022. "Robust optimal classification trees under noisy labels," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(1), pages 155-179, March.
    4. Benítez-Peña, Sandra & Carrizosa, Emilio & Guerrero, Vanesa & Jiménez-Gamero, M. Dolores & Martín-Barragán, Belén & Molero-Río, Cristina & Ramírez-Cobo, Pepa & Romero Morales, Dolores & Sillero-Denami, 2021. "On sparse ensemble methods: An application to short-term predictions of the evolution of COVID-19," European Journal of Operational Research, Elsevier, vol. 295(2), pages 648-663.
    5. Davila-Pena, Laura & García-Jurado, Ignacio & Casas-Méndez, Balbina, 2022. "Assessment of the influence of features on a classification problem: An application to COVID-19 patients," European Journal of Operational Research, Elsevier, vol. 299(2), pages 631-641.
    6. Dimitris Bertsimas & Cheol Woo Kim, 2023. "A Prescriptive Machine Learning Approach to Mixed-Integer Convex Optimization," INFORMS Journal on Computing, INFORMS, vol. 35(6), pages 1225-1241, November.
    7. Teddy Lazebnik & Tzach Fleischer & Amit Yaniv-Rosenfeld, 2023. "Benchmarking Biologically-Inspired Automatic Machine Learning for Economic Tasks," Sustainability, MDPI, vol. 15(14), pages 1-9, July.
    8. Emilio Carrizosa & Vanesa Guerrero & Dolores Romero Morales, 2023. "On mathematical optimization for clustering categories in contingency tables," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(2), pages 407-429, June.
    9. Astorino, Annabella & Avolio, Matteo & Fuduli, Antonio, 2022. "A maximum-margin multisphere approach for binary Multiple Instance Learning," European Journal of Operational Research, Elsevier, vol. 299(2), pages 642-652.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Blanquero, Rafael & Carrizosa, Emilio & Molero-Río, Cristina & Romero Morales, Dolores, 2020. "Sparsity in optimal randomized classification trees," European Journal of Operational Research, Elsevier, vol. 284(1), pages 255-272.
    2. Lotfi Boudabsa & Damir Filipovi'c, 2022. "Ensemble learning for portfolio valuation and risk management," Papers 2204.05926, arXiv.org.
    3. Max Biggs & Rim Hariss & Georgia Perakis, 2023. "Constrained optimization of objective functions determined from random forests," Production and Operations Management, Production and Operations Management Society, vol. 32(2), pages 397-415, February.
    4. Jiaming Mao & Jingzhi Xu, 2020. "Ensemble Learning with Statistical and Structural Models," Papers 2006.05308, arXiv.org.
    5. Blanquero, Rafael & Carrizosa, Emilio & Molero-Río, Cristina & Morales, Dolores Romero, 2022. "On sparse optimal regression trees," European Journal of Operational Research, Elsevier, vol. 299(3), pages 1045-1054.
    6. Christophe Dutang & Quentin Guibert, 2021. "An explicit split point procedure in model-based trees allowing for a quick fitting of GLM trees and GLM forests," Post-Print hal-03448250, HAL.
    7. Zhexiao Lin & Fang Han, 2022. "On regression-adjusted imputation estimators of the average treatment effect," Papers 2212.05424, arXiv.org, revised Jan 2023.
    8. Bissan Ghaddar & Ignacio Gómez-Casares & Julio González-Díaz & Brais González-Rodríguez & Beatriz Pateiro-López & Sofía Rodríguez-Ballesteros, 2023. "Learning for Spatial Branching: An Algorithm Selection Approach," INFORMS Journal on Computing, INFORMS, vol. 35(5), pages 1024-1043, September.
    9. Patrick Krennmair & Timo Schmid, 2022. "Flexible domain prediction using mixed effects random forests," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1865-1894, November.
    10. Borup, Daniel & Christensen, Bent Jesper & Mühlbach, Nicolaj Søndergaard & Nielsen, Mikkel Slot, 2023. "Targeting predictors in random forest regression," International Journal of Forecasting, Elsevier, vol. 39(2), pages 841-868.
    11. Yiyi Huo & Yingying Fan & Fang Han, 2023. "On the adaptation of causal forests to manifold data," Papers 2311.16486, arXiv.org, revised Dec 2023.
    12. Escribano, Álvaro & Wang, Dandan, 2021. "Mixed random forest, cointegration, and forecasting gasoline prices," International Journal of Forecasting, Elsevier, vol. 37(4), pages 1442-1462.
    13. Yigit Aydede & Jan Ditzen, 2022. "Identifying the regional drivers of influenza-like illness in Nova Scotia with dominance analysis," Papers 2212.06684, arXiv.org.
    14. Yan, Ran & Wang, Shuaian & Du, Yuquan, 2020. "Development of a two-stage ship fuel consumption prediction and reduction model for a dry bulk ship," Transportation Research Part E: Logistics and Transportation Review, Elsevier, vol. 138(C).
    15. Doumpos, Michael & Zopounidis, Constantin, 2011. "Preference disaggregation and statistical learning for multicriteria decision support: A review," European Journal of Operational Research, Elsevier, vol. 209(3), pages 203-214, March.
    16. Daniel Boller & Michael Lechner & Gabriel Okasa, 2021. "The Effect of Sport in Online Dating: Evidence from Causal Machine Learning," Papers 2104.04601, arXiv.org.
    17. Gambella, Claudio & Ghaddar, Bissan & Naoum-Sawaya, Joe, 2021. "Optimization problems for machine learning: A survey," European Journal of Operational Research, Elsevier, vol. 290(3), pages 807-828.
    18. Yagli, Gokhan Mert & Yang, Dazhi & Srinivasan, Dipti, 2019. "Automatic hourly solar forecasting using machine learning models," Renewable and Sustainable Energy Reviews, Elsevier, vol. 105(C), pages 487-498.
    19. Carrizosa, Emilio & Kurishchenko, Kseniia & Marín, Alfredo & Romero Morales, Dolores, 2022. "Interpreting clusters via prototype optimization," Omega, Elsevier, vol. 107(C).
    20. Valente, Marica, 2023. "Policy evaluation of waste pricing programs using heterogeneous causal effect estimation," Journal of Environmental Economics and Management, Elsevier, vol. 117(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:topjnl:v:29:y:2021:i:1:d:10.1007_s11750-021-00594-1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.