IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0308543.html
   My bibliography  Save this article

Evaluating variable selection methods for multivariable regression models: A simulation study protocol

Author

Listed:
  • Theresa Ullmann
  • Georg Heinze
  • Lorena Hafermann
  • Christine Schilhart-Wallisch
  • Daniela Dunkler
  • for TG2 of the STRATOS initiative

Abstract

Researchers often perform data-driven variable selection when modeling the associations between an outcome and multiple independent variables in regression analysis. Variable selection may improve the interpretability, parsimony and/or predictive accuracy of a model. Yet variable selection can also have negative consequences, such as false exclusion of important variables or inclusion of noise variables, biased estimation of regression coefficients, underestimated standard errors and invalid confidence intervals, as well as model instability. While the potential advantages and disadvantages of variable selection have been discussed in the literature for decades, few large-scale simulation studies have neutrally compared data-driven variable selection methods with respect to their consequences for the resulting models. We present the protocol for a simulation study that will evaluate different variable selection methods: forward selection, stepwise forward selection, backward elimination, augmented backward elimination, univariable selection, univariable selection followed by backward elimination, and penalized likelihood approaches (Lasso, relaxed Lasso, adaptive Lasso). These methods will be compared with respect to false inclusion and/or exclusion of variables, consequences on bias and variance of the estimated regression coefficients, the validity of the confidence intervals for the coefficients, the accuracy of the estimated variable importance ranking, and the predictive performance of the selected models. We consider both linear and logistic regression in a low-dimensional setting (20 independent variables with 10 true predictors and 10 noise variables). The simulation will be based on real-world data from the National Health and Nutrition Examination Survey (NHANES). Publishing this study protocol ahead of performing the simulation increases transparency and allows integrating the perspective of other experts into the study design.

Suggested Citation

  • Theresa Ullmann & Georg Heinze & Lorena Hafermann & Christine Schilhart-Wallisch & Daniela Dunkler & for TG2 of the STRATOS initiative, 2024. "Evaluating variable selection methods for multivariable regression models: A simulation study protocol," PLOS ONE, Public Library of Science, vol. 19(8), pages 1-19, August.
  • Handle: RePEc:plo:pone00:0308543
    DOI: 10.1371/journal.pone.0308543
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0308543
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0308543&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0308543?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    2. Edwin Kipruto & Willi Sauerbrei, 2022. "Comparison of variable selection procedures and investigation of the role of shrinkage in linear regression-protocol of a simulation study in low-dimensional data," PLOS ONE, Public Library of Science, vol. 17(10), pages 1-11, October.
    3. Anne-Laure Boulesteix & Sabine Lauer & Manuel J A Eugster, 2013. "A Plea for Neutral Comparison Studies in Computational Sciences," PLOS ONE, Public Library of Science, vol. 8(4), pages 1-11, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    2. Margherita Giuzio, 2017. "Genetic algorithm versus classical methods in sparse index tracking," Decisions in Economics and Finance, Springer;Associazione per la Matematica, vol. 40(1), pages 243-256, November.
    3. Xu, Yang & Zhao, Shishun & Hu, Tao & Sun, Jianguo, 2021. "Variable selection for generalized odds rate mixture cure models with interval-censored failure time data," Computational Statistics & Data Analysis, Elsevier, vol. 156(C).
    4. Emmanouil Androulakis & Christos Koukouvinos & Kalliopi Mylona & Filia Vonta, 2010. "A real survival analysis application via variable selection methods for Cox's proportional hazards model," Journal of Applied Statistics, Taylor & Francis Journals, vol. 37(8), pages 1399-1406.
    5. Li, Chunyu & Lou, Chenxin & Luo, Dan & Xing, Kai, 2021. "Chinese corporate distress prediction using LASSO: The role of earnings management," International Review of Financial Analysis, Elsevier, vol. 76(C).
    6. Ying Huang & Shibasish Dasgupta, 2019. "Likelihood-Based Methods for Assessing Principal Surrogate Endpoints in Vaccine Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 11(3), pages 504-523, December.
    7. Sophie Brana & Dalila Chenaf-Nicet & Delphine Lahet, 2023. "Drivers of cross-border bank claims: The role of foreign-owned banks in emerging countries," Working Papers 2023.06, International Network for Economic Research - INFER.
    8. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    9. Ni, Xiao & Zhang, Hao Helen & Zhang, Daowen, 2009. "Automatic model selection for partially linear models," Journal of Multivariate Analysis, Elsevier, vol. 100(9), pages 2100-2111, October.
    10. Avagyan, Vahe & Alonso Fernández, Andrés Modesto & Nogales, Francisco J., 2015. "D-trace Precision Matrix Estimation Using Adaptive Lasso Penalties," DES - Working Papers. Statistics and Econometrics. WS 21775, Universidad Carlos III de Madrid. Departamento de Estadística.
    11. Byron Botha & Rulof Burger & Kevin Kotzé & Neil Rankin & Daan Steenkamp, 2023. "Big data forecasting of South African inflation," Empirical Economics, Springer, vol. 65(1), pages 149-188, July.
    12. Yanlin Tang & Xinyuan Song & Zhongyi Zhu, 2015. "Variable selection via composite quantile regression with dependent errors," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 69(1), pages 1-20, February.
    13. Gustavo Peralta, 2016. "The Nature of Volatility Spillovers across the International Capital Markets," CNMV Working Papers CNMV Working Papers no. 6, CNMV- Spanish Securities Markets Commission - Research and Statistics Department.
    14. Bakalli, Gaetan & Guerrier, Stéphane & Scaillet, Olivier, 2023. "A penalized two-pass regression to predict stock returns with time-varying risk premia," Journal of Econometrics, Elsevier, vol. 237(2).
    15. Peng, Heng & Lu, Ying, 2012. "Model selection in linear mixed effect models," Journal of Multivariate Analysis, Elsevier, vol. 109(C), pages 109-129.
    16. Yize Zhao & Matthias Chung & Brent A. Johnson & Carlos S. Moreno & Qi Long, 2016. "Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1427-1439, October.
    17. Chuliá, Helena & Garrón, Ignacio & Uribe, Jorge M., 2024. "Daily growth at risk: Financial or real drivers? The answer is not always the same," International Journal of Forecasting, Elsevier, vol. 40(2), pages 762-776.
    18. Michał Kos & Małgorzata Bogdan, 2020. "On the Asymptotic Properties of SLOPE," Sankhya A: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 82(2), pages 499-532, August.
    19. G. Aneiros & P. Vieu, 2016. "Sparse nonparametric model for regression with functional covariate," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 28(4), pages 839-859, October.
    20. Philippe Goulet Coulombe & Maxime Leroux & Dalibor Stevanovic & Stéphane Surprenant, 2022. "How is machine learning useful for macroeconomic forecasting?," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 37(5), pages 920-964, August.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0308543. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.