IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v106y2017icp90-102.html

An alternative pruning based approach to unbiased recursive partitioning

Author

Listed:
  • Alvarez-Iglesias, Alberto
  • Hinde, John
  • Ferguson, John
  • Newell, John

Abstract

Tree-based methods are a non-parametric modelling strategy that can be used in combination with generalized linear models or Cox proportional hazards models, mostly at an exploratory stage. Their popularity is mainly due to the simplicity of the technique along with the ease in which the resulting model can be interpreted. Variable selection bias from variables with many possible splits or missing values has been identified as one of the problems associated with tree-based methods. A number of unbiased recursive partitioning algorithms have been proposed that avoid this bias by using p-values in the splitting procedure of the algorithm. The final tree is obtained using direct stopping rules (pre-pruning strategy) or by growing a large tree first and pruning it afterwards (post-pruning). Some of the drawbacks of pre-pruned trees based on p-values in the presence of interaction effects and a large number of explanatory variables are discussed, and a simple alternative post-pruning solution is presented that allows the identification of such interactions. The proposed method includes a novel pruning algorithm that uses a false discovery rate (FDR) controlling procedure for the determination of splits corresponding to significant tests. The new approach is demonstrated with simulated and real-life examples.

Suggested Citation

  • Alvarez-Iglesias, Alberto & Hinde, John & Ferguson, John & Newell, John, 2017. "An alternative pruning based approach to unbiased recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 106(C), pages 90-102.
  • Handle: RePEc:eee:csdana:v:106:y:2017:i:c:p:90-102
    DOI: 10.1016/j.csda.2016.08.011
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S016794731630192X
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2016.08.011?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Kim H. & Loh W.Y., 2001. "Classification Trees With Unbiased Multiway Splits," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 589-604, June.
    2. Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
    3. Shih, Yu-Shan & Tsai, Hsin-Wen, 2004. "Variable selection bias in regression trees with constant fits," Computational Statistics & Data Analysis, Elsevier, vol. 45(3), pages 595-607, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yao Li & Wei Xu, 2025. "Causal Mediation Tree Model for Feature Identification on High-Dimensional Mediators," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 17(1), pages 151-173, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Strobl, Carolin & Boulesteix, Anne-Laure & Augustin, Thomas, 2007. "Unbiased split selection for classification trees based on the Gini Index," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 483-501, September.
    2. Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
    3. Yuan Xu & Huading Shi & Yang Fei & Chao Wang & Li Mo & Mi Shu, 2021. "Identification of Soil Heavy Metal Sources in a Large-Scale Area Affected by Industry," Sustainability, MDPI, vol. 13(2), pages 1-18, January.
    4. Emmanuel Jordy Menvouta & Jolien Ponnet & Robin Van Oirbeek & Tim Verdonck, 2022. "mCube: Multinomial Micro-level reserving Model," Papers 2212.00101, arXiv.org.
    5. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    6. Höppner, Sebastiaan & Stripling, Eugen & Baesens, Bart & Broucke, Seppe vanden & Verdonck, Tim, 2020. "Profit driven decision trees for churn prediction," European Journal of Operational Research, Elsevier, vol. 284(3), pages 920-933.
    7. Chi-Chang Chang & Tse-Hung Huang & Pei-Wei Shueng & Ssu-Han Chen & Chun-Chia Chen & Chi-Jie Lu & Yi-Ju Tseng, 2021. "Developing a Stacked Ensemble-Based Classification Scheme to Predict Second Primary Cancers in Head and Neck Cancer Survivors," IJERPH, MDPI, vol. 18(23), pages 1-10, November.
    8. Silke Janitza & Ender Celik & Anne-Laure Boulesteix, 2018. "A computationally fast variable importance test for random forests for high-dimensional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(4), pages 885-915, December.
    9. Shih, Yu-Shan & Tsai, Hsin-Wen, 2004. "Variable selection bias in regression trees with constant fits," Computational Statistics & Data Analysis, Elsevier, vol. 45(3), pages 595-607, April.
    10. Gerhard Tutz & Moritz Berger, 2016. "Item-focussed Trees for the Identification of Items in Differential Item Functioning," Psychometrika, Springer;The Psychometric Society, vol. 81(3), pages 727-750, September.
    11. Christophe Dutang & Quentin Guibert, 2021. "An explicit split point procedure in model-based trees allowing for a quick fitting of GLM trees and GLM forests," Post-Print hal-03448250, HAL.
    12. Islam, Towhidul & Meade, Nigel & Carson, Richard T. & Louviere, Jordan J. & Wang, Juan, 2022. "The usefulness of socio-demographic variables in predicting purchase decisions: Evidence from machine learning procedures," Journal of Business Research, Elsevier, vol. 151(C), pages 324-338.
    13. Patrick Rehill, 2024. "Distilling interpretable causal trees from causal forests," Papers 2408.01023, arXiv.org.
    14. Gray, J. Brian & Fan, Guangzhe, 2008. "Classification tree analysis using TARGET," Computational Statistics & Data Analysis, Elsevier, vol. 52(3), pages 1362-1372, January.
    15. Noh, Hyun Gon & Song, Moon Sup & Park, Sung Hyun, 2004. "An unbiased method for constructing multilabel classification trees," Computational Statistics & Data Analysis, Elsevier, vol. 47(1), pages 149-164, August.
    16. Dimitris Bertsimas & Margrét V. Bjarnadóttir & Michael A. Kane & J. Christian Kryder & Rudra Pandey & Santosh Vempala & Grant Wang, 2008. "Algorithmic Prediction of Health-Care Costs," Operations Research, INFORMS, vol. 56(6), pages 1382-1392, December.
    17. Wei-Yin Loh, 2014. "Fifty Years of Classification and Regression Trees," International Statistical Review, International Statistical Institute, vol. 82(3), pages 329-348, December.
    18. Hajko, Vladimír, 2017. "The failure of Energy-Economy Nexus: A meta-analysis of 104 studies," Energy, Elsevier, vol. 125(C), pages 771-787.
    19. Susan Athey & Stefan Wager, 2021. "Policy Learning With Observational Data," Econometrica, Econometric Society, vol. 89(1), pages 133-161, January.
    20. Lee, Tzu-Haw & Shih, Yu-Shan, 2006. "Unbiased variable selection for classification trees with multivariate responses," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 659-667, November.

    More about this item

    Keywords

    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:106:y:2017:i:c:p:90-102. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.