Learning from high dimensional data based on weighted feature importance in decision tree ensembles

My bibliography Save this article

Learning from high dimensional data based on weighted feature importance in decision tree ensembles

Author

Listed:

Nayiri Galestian Pour
(University of Tehran)
Soudabeh Shemehsavar
(University of Tehran)

Registered:

Abstract

Learning from high dimensional data has been utilized in various applications such as computational biology, image classification, and finance. Most classical machine learning algorithms fail to give accurate predictions in high dimensional settings due to the enormous feature space. In this article, we present a novel ensemble of classification trees based on weighted random subspaces that aims to adjust the distribution of selection probabilities. In the proposed algorithm base classifiers are built on random feature subspaces in which the probability that influential features will be selected for the next subspace, is updated by incorporating grouping information based on previous classifiers through a weighting function. As an interpretation tool, we show that variable importance measures computed by the new method can identify influential features efficiently. We provide theoretical reasoning for the different elements of the proposed method, and we evaluate the usefulness of the new method based on simulation studies and real data analysis.

Suggested Citation

Nayiri Galestian Pour & Soudabeh Shemehsavar, 2024. "Learning from high dimensional data based on weighted feature importance in decision tree ensembles," Computational Statistics, Springer, vol. 39(1), pages 313-342, February.

Handle: RePEc:spr:compst:v:39:y:2024:i:1:d:10.1007_s00180-023-01347-3
DOI: 10.1007/s00180-023-01347-3

Download full text from publisher

As the access to this document is restricted, you may want to

for a different version of it.

References listed on IDEAS

Ahn, Hongshik & Moon, Hojin & Fazzari, Melissa J. & Lim, Noha & Chen, James J. & Kodell, Ralph L., 2007. "Classification by ensembles from random partitions of high-dimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6166-6179, August.
Ruoqing Zhu & Donglin Zeng & Michael R. Kosorok, 2015. "Reinforcement Learning Trees," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1770-1784, December.
Antonio R. Linero, 2018. "Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(522), pages 626-636, April.
Blaser, Rico & Fryzlewicz, Piotr, 2016. "Random rotation ensembles," LSE Research Online Documents on Economics 62182, London School of Economics and Political Science, LSE Library.
Timothy I. Cannings & Richard J. Samworth, 2017. "Random-projection ensemble classification," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(4), pages 959-1035, September.
Baoxun Xu & Joshua Zhexue Huang & Graham Williams & Qiang Wang & Yunming Ye, 2012. "Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces," International Journal of Data Warehousing and Mining (IJDWM), IGI Global, vol. 8(2), pages 44-63, April.
Zhao, He & Williams, Graham J. & Huang, Joshua Zhexue, 2017. "wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 77(i03).

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Yi Liu & Veronika Ročková & Yuexi Wang, 2021. "Variable selection with ABC Bayesian forests," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(3), pages 453-481, July.
Bergsma, Wicher P, 2020. "Regression with I-priors," Econometrics and Statistics, Elsevier, vol. 14(C), pages 89-111.
Ruoqing Zhu & Ying-Qi Zhao & Guanhua Chen & Shuangge Ma & Hongyu Zhao, 2017. "Greedy outcome weighted tree learning of optimal personalized treatment rules," Biometrics, The International Biometric Society, vol. 73(2), pages 391-400, June.
Huber, Florian & Koop, Gary & Onorante, Luca & Pfarrhofer, Michael & Schreiner, Josef, 2023. "Nowcasting in a pandemic using non-parametric mixed frequency VARs," Journal of Econometrics, Elsevier, vol. 232(1), pages 52-69.
- Florian Huber & Gary Koop & Luca Onorante & Michael Pfarrhofer & Josef Schreiner, 2020. "Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs," Papers 2008.12706, arXiv.org, revised Dec 2020.
- Florian, Huber & Koop, Gary & Onorante, Luca & Pfarrhofer, Michael & Schreiner, Josef, 2021. "Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs," JRC Working Papers in Economics and Finance 2021-01, Joint Research Centre, European Commission.
- Huber, Florian & Koop, Gary & Onorante, Luca & Pfarrhofer, Michael & Schreiner, Josef, 2021. "Nowcasting in a pandemic using non-parametric mixed frequency VARs," Working Paper Series 2510, European Central Bank.
Fuli Zhang & Kung‐Sik Chan, 2023. "Random projection ensemble classification with high‐dimensional time series," Biometrics, The International Biometric Society, vol. 79(2), pages 964-974, June.
Pedro Delicado & Daniel Peña, 2023. "Understanding complex predictive models with ghost variables," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 32(1), pages 107-145, March.
Falco J. Bargagli Stoffi & Kenneth De Beckker & Joana E. Maldonado & Kristof De Witte, 2021. "Assessing Sensitivity of Machine Learning Predictions.A Novel Toolbox with an Application to Financial Literacy," Papers 2102.04382, arXiv.org.
José A. Ferreira, 2022. "Models under which random forests perform badly; consequences for applications," Computational Statistics, Springer, vol. 37(4), pages 1839-1854, September.
Jiang, Qing & Hušková, Marie & Meintanis, Simos G. & Zhu, Lixing, 2019. "Asymptotics, finite-sample comparisons and applications for two-sample tests with functional data," Journal of Multivariate Analysis, Elsevier, vol. 170(C), pages 202-220.
Zardad Khan & Asma Gul & Aris Perperoglou & Miftahuddin Miftahuddin & Osama Mahmoud & Werner Adler & Berthold Lausen, 2020. "Ensemble of optimal trees, random forest and random projection ensemble classification," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(1), pages 97-116, March.
Susan Athey & Julie Tibshirani & Stefan Wager, 2016. "Generalized Random Forests," Papers 1610.01271, arXiv.org, revised Apr 2018.
- Athey, Susan & Tibshirani, Julie & Wager, Stefan, 2017. "Generalized Random Forests," Research Papers 3575, Stanford University, Graduate School of Business.
Deshpande Sameer K. & Evans Katherine, 2020. "Expected hypothetical completion probability," Journal of Quantitative Analysis in Sports, De Gruyter, vol. 16(2), pages 85-94, June.
Billio, Monica & Casarin, Roberto & Costola, Michele & Veggente, Veronica, 2024. "Learning from experts: Energy efficiency in residential buildings," Energy Economics, Elsevier, vol. 136(C).
- Billio, Monica & Casarin, Roberto & Costola, Michele & Veggente, Veronica, 2023. "Learning from experts: Energy efficiency in residential buildings," SAFE Working Paper Series 403, Leibniz Institute for Financial Research SAFE.
Chou, Yuntsai & Lin, Wei, 2024. "Blockbuster or Flop? Effects of Social Media on the Chinese Film Market," 24th ITS Biennial Conference, Seoul 2024. New bottles for new wine: digital transformation demands new policies and strategies 302460, International Telecommunications Society (ITS).
Yaojun Zhang & Lanpeng Ji & Georgios Aivaliotis & Charles Taylor, 2023. "Bayesian CART models for insurance claims frequency," Papers 2303.01923, arXiv.org, revised Dec 2023.
Liu, Yehong & Yin, Guosheng, 2020. "The Delaunay triangulation learner and its ensembles," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
Devin Young & Britannia Vondrasek & Michael W. Czabaj, 2025. "Machine learning guided design of experiments to accelerate exploration of a material extrusion process parameter space," Journal of Intelligent Manufacturing, Springer, vol. 36(1), pages 491-508, January.
Maia, Mateus & Murphy, Keefe & Parnell, Andrew C., 2024. "GP-BART: A novel Bayesian additive regression trees approach using Gaussian processes," Computational Statistics & Data Analysis, Elsevier, vol. 190(C).
Yatracos, Yannis G., 2018. "Residual'S Influence Index (Rinfin), Bad Leverage And Unmasking In High Dimensional L2-Regression," IRTG 1792 Discussion Papers 2018-060, Humboldt University of Berlin, International Research Training Group 1792 "High Dimensional Nonstationary Time Series".
Niu, Zibo & Wang, Chenlu & Zhang, Hongwei, 2023. "Forecasting stock market volatility with various geopolitical risks categories: New evidence from machine learning models," International Review of Financial Analysis, Elsevier, vol. 89(C).

More about this item

Keywords

; ; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:39:y:2024:i:1:d:10.1007_s00180-023-01347-3. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Learning from high dimensional data based on weighted feature importance in decision tree ensembles

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data