IDEAS home Printed from https://ideas.repec.org/a/spr/qualqt/v56y2022i1d10.1007_s11135-021-01114-w.html
   My bibliography  Save this article

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Author

Listed:
  • Svetlana Zhuchkova

    (HSE University)

  • Aleksei Rotmistrov

    (HSE University)

Abstract

The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.

Suggested Citation

  • Svetlana Zhuchkova & Aleksei Rotmistrov, 2022. "How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(1), pages 1-22, February.
  • Handle: RePEc:spr:qualqt:v:56:y:2022:i:1:d:10.1007_s11135-021-01114-w
    DOI: 10.1007/s11135-021-01114-w
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11135-021-01114-w
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11135-021-01114-w?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Jin Chen & Don Hossler, 2017. "The Effects of Financial Aid on College Success of Two-Year Beginning Nontraditional Students," Research in Higher Education, Springer;Association for Institutional Research, vol. 58(1), pages 40-76, February.
    2. Dougherty, Christopher, 2016. "Introduction to Econometrics," OUP Catalogue, Oxford University Press, edition 5, number 9780199676828.
    3. Olanrewaju Akande & Fan Li & Jerome Reiter, 2017. "An Empirical Comparison of Multiple Imputation Methods for Categorical Data," The American Statistician, Taylor & Francis Journals, vol. 71(2), pages 162-170, April.
    4. Doove, L.L. & Van Buuren, S. & Dusseldorp, E., 2014. "Recursive partitioning for missing data imputation in the presence of interaction effects," Computational Statistics & Data Analysis, Elsevier, vol. 72(C), pages 92-104.
    5. Nevena Zhelyazkova & Gilbert Ritschard, 2018. "Parental Leave Take-Up of Fathers in Luxembourg," Population Research and Policy Review, Springer;Southern Demographic Association (SDA), vol. 37(5), pages 769-793, October.
    6. Michael Greenacre & Rafael Pardo, 2006. "Subset Correspondence Analysis," Sociological Methods & Research, , vol. 35(2), pages 193-218, November.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Levy, Becca R. & Pietrzak, Robert H. & Slade, Martin D., 2023. "Societal impact on older persons’ chronic pain: Roles of age stereotypes, age attribution, and age discrimination," Social Science & Medicine, Elsevier, vol. 323(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Humera Razzak & Christian Heumann, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    2. Razzak Humera & Heumann Christian, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    3. Kenneth G. Stewart, 2019. "Suits' Watermelon Model: The Missing Simultaneous Equations Empirical Application," Journal of Economics Teaching, Journal of Economics Teaching, vol. 4(2), pages 115-139, December.
    4. Blasius, Jörg & Eilers, Paul H.C. & Gower, John, 2009. "Better biplots," Computational Statistics & Data Analysis, Elsevier, vol. 53(8), pages 3145-3158, June.
    5. Zachary H. Seeskin, 2016. "Evaluating the Use of Commercial Data to Improve Survey Estimates of Property Taxes," CARRA Working Papers 2016-06, Center for Economic Studies, U.S. Census Bureau.
    6. Youngjoo Cho & Debashis Ghosh, 2021. "Quantile-Based Subgroup Identification for Randomized Clinical Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(1), pages 90-128, April.
    7. Takane, Yoshio & Jung, Sunho, 2009. "Regularized nonsymmetric correspondence analysis," Computational Statistics & Data Analysis, Elsevier, vol. 53(8), pages 3159-3170, June.
    8. Joshua D. Angrist & Jörn-Steffen Pischke, 2017. "Undergraduate Econometrics Instruction: Through Our Classes, Darkly," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 125-144, Spring.
    9. Tessmann, R. & Elbert, R., 2022. "Multi sided platforms in competitive B2B networks with varying governmental influence – a taxonomy of Port and Cargo Community System business models," Publications of Darmstadt Technical University, Institute for Business Studies (BWL) 132320, Darmstadt Technical University, Department of Business Administration, Economics and Law, Institute for Business Studies (BWL).
    10. A. R. Linero, 2017. "Bayesian nonparametric analysis of longitudinal studies in the presence of informative missingness," Biometrika, Biometrika Trust, vol. 104(2), pages 327-341.
    11. Xiaofei Ma & Qiuyan Zhong, 2016. "Missing value imputation method for disaster decision-making using K nearest neighbor," Journal of Applied Statistics, Taylor & Francis Journals, vol. 43(4), pages 767-781, March.
    12. Khalid Mehmood & Sajjad Ahmad & Tariq Mehmood & Muhammad Mohsin & Muhammad Ishfaq, 2022. "Does Laffer Curve Exist in Tax Structure of Pakistan? A Threshold Regression Analysis," Journal of Economic Impact, Science Impact Publishers, vol. 4(1), pages 145-149.
    13. Saccaro, Alice & França, Marco Túlio Aniceto, 2020. "Stop-out and drop-out: The behavior of the first year withdrawal of students of the Brazilian higher education receiving FIES funding," International Journal of Educational Development, Elsevier, vol. 77(C).
    14. Kubiv Stepan, 2019. "Approximations and forecasting quasi-stationary processes with sudden runs," Technology audit and production reserves, 4(48) 2019, Socionet;Technology audit and production reserves, vol. 4(4(48)), pages 37-39.
    15. Roth, Jonathan & Lim, Benjamin & Jain, Rishee K. & Grueneich, Dian, 2020. "Examining the feasibility of using open data to benchmark building energy usage in cities: A data science and policy perspective," Energy Policy, Elsevier, vol. 139(C).
    16. Hipp, Lena & Schlüter, Charlotte & Molina, Stefania, 2022. "The role of employers in reducing the implementation gap in leave policies," Discussion Papers, Junior Research Group Work and Care SP I 2022-502, WZB Berlin Social Science Center.
    17. Steven D. Silver, 2018. "Multivariate methodology for discriminating market segments in urban commuting," Public Transport, Springer, vol. 10(1), pages 63-89, May.
    18. Hayes, Timothy & McArdle, John J., 2017. "Should we impute or should we weight? Examining the performance of two CART-based techniques for addressing missing data in small sample research with nonnormal variables," Computational Statistics & Data Analysis, Elsevier, vol. 115(C), pages 35-52.
    19. Novkovska, Blagica & Dumicic, Ksenija, 2019. "Ordering Goods And Services Online In South East European Countries: Comparison By Cluster Analysis," UTMS Journal of Economics, University of Tourism and Management, Skopje, Macedonia, vol. 10(2), pages 163-173.
    20. Dominique J. Baker & William R. Doyle, 2017. "Impact of Community College Student Debt Levels on Credit Accumulation," The ANNALS of the American Academy of Political and Social Science, , vol. 671(1), pages 132-153, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:56:y:2022:i:1:d:10.1007_s11135-021-01114-w. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.