IDEAS home Printed from https://ideas.repec.org/a/kap/compec/v66y2025i2d10.1007_s10614-024-10741-y.html
   My bibliography  Save this article

Ensemble with Divisive Bagging for Feature Selection in Big Data

Author

Listed:
  • Yousung Park

    (Korea University)

  • Tae Yeon Kwon

    (Hankuk University of Foreign Studies)

Abstract

We introduce Ensemble with Divisive Bagging (EDB), a new feature selection method in linear models, to address the excessive selection of features in big data due to deflated p-values. Extensive simulations show that EDB derives parsimonious models without loss of predictive performance compared to lasso, ridge, elastic-net, LARS, and FS. We also show that EDB estimates feature importance in linear models more accurately compared to Random Forest, XGBoost, and CatBoost. Additionally, we apply EDB to feature selection in models for house prices and loan defaults. Our findings highlight the advantages of EDB: (1) effectively addressing deflated p-values and preventing the inclusion of extraneous features; (2) ensuring unbiased coefficient estimation; (3) adaptability to various models relying on p-value-based inferences; (4) construction of statistically explainable models with feature attribution and importance by preserving inferences based on a linear model and p-values; and (5) allowing application to linear economic models without altering the previous functional form of the model.

Suggested Citation

  • Yousung Park & Tae Yeon Kwon, 2025. "Ensemble with Divisive Bagging for Feature Selection in Big Data," Computational Economics, Springer;Society for Computational Economics, vol. 66(2), pages 1321-1354, August.
  • Handle: RePEc:kap:compec:v:66:y:2025:i:2:d:10.1007_s10614-024-10741-y
    DOI: 10.1007/s10614-024-10741-y
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10614-024-10741-y
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10614-024-10741-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Liu, Chengwei & Chan, Yixiang & Alam Kazmi, Syed Hasnain & Fu, Hao, 2015. "Financial Fraud Detection Model Based on Random Forest," MPRA Paper 65404, University Library of Munich, Germany.
    2. Charles Himmelberg & Christopher Mayer & Todd Sinai, 2005. "Assessing High House Prices: Bubbles, Fundamentals and Misperceptions," Journal of Economic Perspectives, American Economic Association, vol. 19(4), pages 67-92, Fall.
    3. Bos, J.W.B. & Kool, C.J.M., 2006. "Bank efficiency: The role of bank strategy and local market conditions," Journal of Banking & Finance, Elsevier, vol. 30(7), pages 1953-1974, July.
    4. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    5. Ronald L. Wasserstein & Allen L. Schirm & Nicole A. Lazar, 2019. "Moving to a World Beyond “p," The American Statistician, Taylor & Francis Journals, vol. 73(S1), pages 1-19, March.
    6. Juvenal José Duarte & Sahudy Montenegro González & José César Cruz, 2021. "Predicting Stock Price Falls Using News Data: Evidence from the Brazilian Market," Computational Economics, Springer;Society for Computational Economics, vol. 57(1), pages 311-340, January.
    7. Climent, Francisco & Momparler, Alexandre & Carmona, Pedro, 2019. "Anticipating bank distress in the Eurozone: An Extreme Gradient Boosting approach," Journal of Business Research, Elsevier, vol. 101(C), pages 885-896.
    8. Alhanouf Abdulrahman Saleh Alsuwailem & Emad Salem & Abdul Khader Jilani Saudagar, 2023. "Performance of Different Machine Learning Algorithms in Detecting Financial Fraud," Computational Economics, Springer;Society for Computational Economics, vol. 62(4), pages 1631-1667, December.
    9. Ghent, Andra C. & Owyang, Michael T., 2010. "Is housing the business cycle? Evidence from US cities," Journal of Urban Economics, Elsevier, vol. 67(3), pages 336-351, May.
    10. Kim, Jae H. & Ji, Philip Inyeob, 2015. "Significance testing in empirical finance: A critical review and assessment," Journal of Empirical Finance, Elsevier, vol. 34(C), pages 1-14.
    11. Fatemeh Safara, 2022. "A Computational Model to Predict Consumer Behaviour During COVID-19 Pandemic," Computational Economics, Springer;Society for Computational Economics, vol. 59(4), pages 1525-1538, April.
    12. Julian Senoner & Torbjørn Netland & Stefan Feuerriegel, 2022. "Using Explainable Artificial Intelligence to Improve Process Quality: Evidence from Semiconductor Manufacturing," Management Science, INFORMS, vol. 68(8), pages 5704-5723, August.
    13. Sami Ben Jabeur & Amir Sadaaoui & Asma Sghaier & Riadh Aloui, 2020. "Machine learning models and costsensitive decision trees for bond rating prediction," Post-Print hal-05149131, HAL.
    14. Zhang, Jie & Meng, Meng & Wong, Yiik Diew & Ieromonachou, Petros & Wang, David Z.W., 2021. "A data-driven dynamic repositioning model in bicycle-sharing systems," International Journal of Production Economics, Elsevier, vol. 231(C).
    15. Campbell, John Y. & Yogo, Motohiro, 2006. "Efficient tests of stock return predictability," Journal of Financial Economics, Elsevier, vol. 81(1), pages 27-60, July.
    16. Susan Athey & Guido W. Imbens, 2019. "Machine Learning Methods That Economists Should Know About," Annual Review of Economics, Annual Reviews, vol. 11(1), pages 685-725, August.
    17. Athey, Susan & Imbens, Guido W., 2019. "Machine Learning Methods Economists Should Know About," Research Papers 3776, Stanford University, Graduate School of Business.
    18. Vesna Karadžić & Nikola Đalović, 2021. "Profitability Determinants of Big European Banks," Journal of Central Banking Theory and Practice, Central bank of Montenegro, vol. 10(2), pages 39-56.
    19. Dunson, David B., 2018. "Statistics in the big data era: Failures of the machine," Statistics & Probability Letters, Elsevier, vol. 136(C), pages 4-9.
    20. Ronald L. Wasserstein & Nicole A. Lazar, 2016. "The ASA's Statement on p -Values: Context, Process, and Purpose," The American Statistician, Taylor & Francis Journals, vol. 70(2), pages 129-133, May.
    21. Sami Ben Jabeur & Nicolae Stef & Carmona Pedro, 2023. "Bankruptcy prediction using the XGBoost algorithm and variable importance feature engineering," Post-Print hal-05238451, HAL.
    22. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    23. Chi Ming Chen & Geoffrey Kwok Fai Tso & Kaijian He, 2024. "Quantum Optimized Cost Based Feature Selection and Credit Scoring for Mobile Micro-financing," Computational Economics, Springer;Society for Computational Economics, vol. 63(2), pages 919-950, February.
    24. Srijan Sengupta & Stanislav Volgushev & Xiaofeng Shao, 2016. "A Subsampled Double Bootstrap for Massive Data," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(515), pages 1222-1232, July.
    25. Sami Ben Jabeur & Nicolae Stef & Pedro Carmona, 2023. "Bankruptcy Prediction using the XGBoost Algorithm and Variable Importance Feature Engineering," Computational Economics, Springer;Society for Computational Economics, vol. 61(2), pages 715-741, February.
    26. Ariel Kleiner & Ameet Talwalkar & Purnamrita Sarkar & Michael I. Jordan, 2014. "A scalable bootstrap for massive data," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 76(4), pages 795-816, September.
    27. Hamzeh F. Assous & Viorel-Puiu Paun, 2022. "Prediction of Banks Efficiency Using Feature Selection Method: Comparison between Selected Machine Learning Models," Complexity, Hindawi, vol. 2022, pages 1-15, April.
    28. Htet Htet Htun & Michael Biehl & Nicolai Petkov, 2023. "Survey of feature selection and extraction techniques for stock market prediction," Financial Innovation, Springer;Southwestern University of Finance and Economics, vol. 9(1), pages 1-25, December.
    29. Süreyya Özöğür Akyüz & Birsen Eygi Erdogan & Özlem Yıldız & Pınar Karadayı Ataş, 2023. "A Novel Hybrid House Price Prediction Model," Computational Economics, Springer;Society for Computational Economics, vol. 62(3), pages 1215-1232, October.
    30. Stewart Jones, 2017. "Corporate bankruptcy prediction: a high dimensional analysis," Review of Accounting Studies, Springer, vol. 22(3), pages 1366-1422, September.
    31. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    32. Leonhard Held & Manuela Ott, 2016. "How the Maximal Evidence of -Values Against Point Null Hypotheses Depends on Sample Size," The American Statistician, Taylor & Francis Journals, vol. 70(4), pages 335-341, October.
    33. Mingfeng Lin & Henry C. Lucas & Galit Shmueli, 2013. "Research Commentary ---Too Big to Fail: Large Samples and the p -Value Problem," Information Systems Research, INFORMS, vol. 24(4), pages 906-917, December.
    34. Nneji, Ogonna & Brooks, Chris & Ward, Charles W.R., 2013. "House price dynamics and their reaction to macroeconomic changes," Economic Modelling, Elsevier, vol. 32(C), pages 172-178.
    35. Sami Ben Jabeur & Amir Sadaaoui & Asma Sghaier & Riadh Aloui, 2020. "Machine learning models and cost-sensitive decision trees for bond rating prediction," Journal of the Operational Research Society, Taylor & Francis Journals, vol. 71(8), pages 1161-1179, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Elena Ivona DUMITRESCU & Sullivan HUE & Christophe HURLIN & Sessi TOKPAVI, 2020. "Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds," LEO Working Papers / DR LEO 2839, Orleans Economics Laboratory / Laboratoire d'Economie d'Orleans (LEO), University of Orleans.
    2. Chen, Ya & Tsionas, Mike G. & Zelenyuk, Valentin, 2021. "LASSO+DEA for small and big wide data," Omega, Elsevier, vol. 102(C).
    3. Ya Chen & Mike Tsionas & Valentin Zelenyuk, 2020. "LASSO DEA for small and big data," CEPA Working Papers Series WP092020, School of Economics, University of Queensland, Australia.
    4. Julien Chevallier & Dominique Guégan & Stéphane Goutte, 2021. "Is It Possible to Forecast the Price of Bitcoin?," Forecasting, MDPI, vol. 3(2), pages 1-44, May.
    5. Barzin,Samira & Avner,Paolo & Maruyama Rentschler,Jun Erik & O’Clery,Neave, 2022. "Where Are All the Jobs ? A Machine Learning Approach for High Resolution Urban Employment Prediction inDeveloping Countries," Policy Research Working Paper Series 9979, The World Bank.
    6. Michael Lechner, 2023. "Causal Machine Learning and its use for public policy," Swiss Journal of Economics and Statistics, Springer;Swiss Society of Economics and Statistics, vol. 159(1), pages 1-15, December.
    7. Xing, Li-Min & Zhang, Yue-Jun, 2022. "Forecasting crude oil prices with shrinkage methods: Can nonconvex penalty and Huber loss help?," Energy Economics, Elsevier, vol. 110(C).
    8. James T. E. Chapman & Ajit Desai, 2023. "Macroeconomic Predictions Using Payments Data and Machine Learning," Forecasting, MDPI, vol. 5(4), pages 1-32, November.
    9. Herrera, Gabriel Paes & Constantino, Michel & Su, Jen-Je & Naranpanawa, Athula, 2023. "The use of ICTs and income distribution in Brazil: A machine learning explanation using SHAP values," Telecommunications Policy, Elsevier, vol. 47(8).
    10. Khan, Faridoon & Muhammadullah, Sara & Sharif, Arshian & Lee, Chien-Chiang, 2024. "The role of green energy stock market in forecasting China's crude oil market: An application of IIS approach and sparse regression models," Energy Economics, Elsevier, vol. 130(C).
    11. Matteo Bagnara, 2024. "Asset Pricing and Machine Learning: A critical review," Journal of Economic Surveys, Wiley Blackwell, vol. 38(1), pages 27-56, February.
    12. Adam N. Smith & Stephan Seiler & Ishant Aggarwal, 2023. "Optimal Price Targeting," Marketing Science, INFORMS, vol. 42(3), pages 476-499, May.
    13. Narayan, Seema & Smyth, Russell, 2015. "The financial econometrics of price discovery and predictability," International Review of Financial Analysis, Elsevier, vol. 42(C), pages 380-393.
    14. Hoang, Daniel & Wiegratz, Kevin, 2022. "Machine learning methods in finance: Recent applications and prospects," Working Paper Series in Economics 158, Karlsruhe Institute of Technology (KIT), Department of Economics and Management.
    15. Tang, Lu & Zhou, Ling & Song, Peter X.-K., 2020. "Distributed simultaneous inference in generalized linear models via confidence distribution," Journal of Multivariate Analysis, Elsevier, vol. 176(C).
    16. Baaken, Dominik & Hess, Sebastian, "undated". "Regionale Milchmengenprognose: Regressionsmodelle und Maschinelles Lernen im Vergleich," 61st Annual Conference, Berlin, Germany, September 22-24, 2021 317056, German Association of Agricultural Economists (GEWISOLA).
    17. Baaken, Dominik & Hess, Sebastian, 2021. "Forecasting Regional Milk Production Quantity: A Comparison of Regression Models and Machine Learning," 2021 Conference, August 17-31, 2021, Virtual 315117, International Association of Agricultural Economists.
    18. Gabriel Okasa, 2022. "Meta-Learners for Estimation of Causal Effects: Finite Sample Cross-Fit Performance," Papers 2201.12692, arXiv.org.
    19. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    20. Margherita Giuzio, 2017. "Genetic algorithm versus classical methods in sparse index tracking," Decisions in Economics and Finance, Springer;Associazione per la Matematica, vol. 40(1), pages 243-256, November.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;

    JEL classification:

    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • C52 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Model Evaluation, Validation, and Selection
    • C63 - Mathematical and Quantitative Methods - - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling - - - Computational Techniques
    • C80 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - General
    • C15 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Statistical Simulation Methods: General
    • C51 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Model Construction and Estimation

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:kap:compec:v:66:y:2025:i:2:d:10.1007_s10614-024-10741-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.