IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v13y2025i6p956-d1611773.html
   My bibliography  Save this article

Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study

Author

Listed:
  • Oyebayo Ridwan Olaniran

    (Department of Statistics, Faculty of Physical Sciences, University of Ilorin, Ilorin 1515, Nigeria)

  • Ali Rashash R. Alzahrani

    (Mathematics Department, Faculty of Sciences, Umm Al-Qura University, Makkah 24382, Saudi Arabia)

Abstract

The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multiple imputation (MI) and random forest (RF) proximity-based imputation offer improvements over naive deletion, they exhibit limitations in complex missing data scenarios or sparse high-dimensional settings. To address these gaps, we propose a novel integration of Multiple Imputation by Chained Equations (MICE) with Bayesian Random Forest (BRF), leveraging MICE’s iterative flexibility and BRF’s probabilistic robustness to enhance the imputation accuracy and downstream predictive performance. Our hybrid framework, BRF-MICE, uniquely combines the efficiency of MICE’s chained equations with BRF’s ability to quantify uncertainty through Bayesian tree ensembles, providing stable parameter estimates even under extreme missingness. We empirically validate this approach using synthetic datasets with controlled missingness mechanisms (MCAR, MAR, MNAR) and dimensionality, contrasting it against established methods, including RF and Bayesian Additive Regression Trees (BART). The results demonstrate that BRF-MICE achieves a superior performance in classification and regression tasks, with a 15–20% lower error under varying missingness conditions compared to RF and BART while maintaining computational scalability. The method’s iterative Bayesian updates effectively propagate imputation uncertainty, reducing overconfidence in high-dimensional predictions, a key weakness of frequentist alternatives.

Suggested Citation

  • Oyebayo Ridwan Olaniran & Ali Rashash R. Alzahrani, 2025. "Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study," Mathematics, MDPI, vol. 13(6), pages 1-32, March.
  • Handle: RePEc:gam:jmathe:v:13:y:2025:i:6:p:956-:d:1611773
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/13/6/956/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/13/6/956/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Oyebayo Ridwan Olaniran & Ali Rashash R. Alzahrani, 2023. "On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression," Mathematics, MDPI, vol. 11(24), pages 1-29, December.
    2. Doove, L.L. & Van Buuren, S. & Dusseldorp, E., 2014. "Recursive partitioning for missing data imputation in the presence of interaction effects," Computational Statistics & Data Analysis, Elsevier, vol. 72(C), pages 92-104.
    3. Crookston, Nicholas L. & Finley, Andrew O., 2008. "yaImpute: An R Package for kNN Imputation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 23(i10).
    4. Oyebayo Ridwan Olaniran & Ali Rashash R. Alzahrani & Mohammed R. Alzahrani, 2024. "Eigenvalue Distributions in Random Confusion Matrices: Applications to Machine Learning Evaluation," Mathematics, MDPI, vol. 12(10), pages 1-14, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Oyebayo Ridwan Olaniran & Aliu Omotayo Sikiru & Jeza Allohibi & Abdulmajeed Atiah Alharbi & Nada MohammedSaeed Alharbi, 2025. "Hybrid Random Feature Selection and Recurrent Neural Network for Diabetes Prediction," Mathematics, MDPI, vol. 13(4), pages 1-25, February.
    2. Svetlana Zhuchkova & Aleksei Rotmistrov, 2022. "How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(1), pages 1-22, February.
    3. Zachary H. Seeskin, 2016. "Evaluating the Use of Commercial Data to Improve Survey Estimates of Property Taxes," CARRA Working Papers 2016-06, Center for Economic Studies, U.S. Census Bureau.
    4. Youngjoo Cho & Debashis Ghosh, 2021. "Quantile-Based Subgroup Identification for Randomized Clinical Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(1), pages 90-128, April.
    5. A. R. Linero, 2017. "Bayesian nonparametric analysis of longitudinal studies in the presence of informative missingness," Biometrika, Biometrika Trust, vol. 104(2), pages 327-341.
    6. Xiaofei Ma & Qiuyan Zhong, 2016. "Missing value imputation method for disaster decision-making using K nearest neighbor," Journal of Applied Statistics, Taylor & Francis Journals, vol. 43(4), pages 767-781, March.
    7. Roth, Jonathan & Lim, Benjamin & Jain, Rishee K. & Grueneich, Dian, 2020. "Examining the feasibility of using open data to benchmark building energy usage in cities: A data science and policy perspective," Energy Policy, Elsevier, vol. 139(C).
    8. Gérard Biau & Erwan Scornet, 2016. "A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 197-227, June.
    9. Steven D. Silver, 2018. "Multivariate methodology for discriminating market segments in urban commuting," Public Transport, Springer, vol. 10(1), pages 63-89, May.
    10. Hayes, Timothy & McArdle, John J., 2017. "Should we impute or should we weight? Examining the performance of two CART-based techniques for addressing missing data in small sample research with nonnormal variables," Computational Statistics & Data Analysis, Elsevier, vol. 115(C), pages 35-52.
    11. Humera Razzak & Christian Heumann, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    12. Razzak Humera & Heumann Christian, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Statistics Poland, vol. 20(4), pages 33-58, December.
    13. Oyebayo Ridwan Olaniran & Ali Rashash R. Alzahrani & Nada MohammedSaeed Alharbi & Asma Ahmad Alzahrani, 2025. "Random Generalized Additive Logistic Forest: A Novel Ensemble Method for Robust Binary Classification," Mathematics, MDPI, vol. 13(7), pages 1-25, April.
    14. repec:plo:pone00:0190270 is not listed on IDEAS
    15. Kowarik, Alexander & Templ, Matthias, 2016. "Imputation with the R Package VIM," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i07).
    16. Agni Orfanoudaki & Emma Chesley & Christian Cadisch & Barry Stein & Amre Nouh & Mark J Alberts & Dimitris Bertsimas, 2020. "Machine learning provides evidence that stroke risk is not linear: The non-linear Framingham stroke risk score," PLOS ONE, Public Library of Science, vol. 15(5), pages 1-20, May.
    17. Hui Peng & He Wang & Weijia Kong & Jinyan Li & Wilson Wen Bin Goh, 2024. "Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    18. Anton Kocheturov & Panos M. Pardalos & Athanasia Karakitsiou, 2019. "Massive datasets and machine learning for computational biomedicine: trends and challenges," Annals of Operations Research, Springer, vol. 276(1), pages 5-34, May.
    19. Wenjie Wu & Lijuan Huo & Gaiqiang Yang & Xin Liu & Hongxia Li, 2025. "Research into the Application of ResNet in Soil: A Review," Agriculture, MDPI, vol. 15(6), pages 1-29, March.
    20. Michael Bergrab & Christian Aßmann, 2024. "Automated Bayesian variable selection methods for binary regression models with missing covariate data," AStA Wirtschafts- und Sozialstatistisches Archiv, Springer;Deutsche Statistische Gesellschaft - German Statistical Society, vol. 18(2), pages 203-244, June.
    21. Christian Aßmann & Jean-Christoph Gaasch & Doris Stingl, 2023. "A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models," Psychometrika, Springer;The Psychometric Society, vol. 88(4), pages 1495-1528, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:13:y:2025:i:6:p:956-:d:1611773. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.