IDEAS home Printed from https://ideas.repec.org/p/msh/ebswps/2018-14.html
   My bibliography  Save this paper

Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations

Author

Listed:
  • Nicholas Tierney
  • Dianne Cook

Abstract

Despite the large body of research on missing value distributions and imputation, there is comparatively little literature on how to make it easy to handle, explore, and impute missing values in data. This paper addresses this gap. The new methodology builds upon tidy data principles, with a goal to integrating missing value handling as an integral part of data analysis workflows. New data structures are defined along with new functions (verbs) to perform common operations. Together these provide a cohesive framework for handling, exploring, and imputing missing values. These methods have been made available in the R package naniar.

Suggested Citation

  • Nicholas Tierney & Dianne Cook, 2018. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations," Monash Econometrics and Business Statistics Working Papers 14/18, Monash University, Department of Econometrics and Business Statistics.
  • Handle: RePEc:msh:ebswps:2018-14
    as

    Download full text from publisher

    File URL: https://www.monash.edu/business/ebs/research/publications/ebs/wp14-2018.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kowarik, Alexander & Templ, Matthias, 2016. "Imputation with the R Package VIM," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i07).
    2. van Buuren, Stef & Groothuis-Oudshoorn, Karin, 2011. "mice: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i03).
    3. Lê, Sébastien & Josse, Julie & Husson, François, 2008. "FactoMineR: An R Package for Multivariate Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i01).
    4. Honaker, James & King, Gary & Blackwell, Matthew, 2011. "Amelia II: A Program for Missing Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i07).
    5. Grolemund, Garrett & Wickham, Hadley, 2011. "Dates and Times Made Easy with lubridate," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i03).
    6. Cheng, Xiaoyue & Cook, Dianne & Hofmann, Heike, 2015. "Visually Exploring Missing Values in Multivariable Data Using a Graphical User Interface," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 68(i06).
    7. Josse, Julie & Husson, François, 2016. "missMDA: A Package for Handling Missing Values in Multivariate Data Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 70(i01).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Maria Lucia Parrella & Giuseppina Albano & Michele La Rocca & Cira Perna, 2019. "Reconstructing missing data sequences in multivariate time series: an application to environmental data," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 28(2), pages 359-383, June.
    2. Kowarik, Alexander & Templ, Matthias, 2016. "Imputation with the R Package VIM," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i07).
    3. Schoemaker, Nikita K. & Juffer, Femmie & Rippe, Ralph C.A. & Vermeer, Harriet J. & Stoltenborgh, Marije & Jagersma, Gabrine J. & Maras, Athanasios & Alink, Lenneke R.A., 2020. "Positive parenting in foster care: Testing the effectiveness of a video-feedback intervention program on foster parents’ behavior and attitudes," Children and Youth Services Review, Elsevier, vol. 110(C).
    4. Jiang, Wei & Josse, Julie & Lavielle, Marc, 2020. "Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework," Computational Statistics & Data Analysis, Elsevier, vol. 145(C).
    5. Cheng, Xiaoyue & Cook, Dianne & Hofmann, Heike, 2015. "Visually Exploring Missing Values in Multivariable Data Using a Graphical User Interface," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 68(i06).
    6. Pépin, Antonin & Morel, Kevin & van der Werf, Hayo M.G., 2021. "Conventionalised vs. agroecological practices on organic vegetable farms: Investigating the influence of farm structure in a bifurcation perspective," Agricultural Systems, Elsevier, vol. 190(C).
    7. Nengsih Titin Agustin & Bertrand Frédéric & Maumy-Bertrand Myriam & Meyer Nicolas, 2019. "Determining the number of components in PLS regression on incomplete data set," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 18(6), pages 1-28, December.
    8. World Bank & Organisation for Economic Co-operation and Development, 2017. "A Step Ahead," World Bank Publications, The World Bank, number 27527, December.
    9. Adel Bosch & Steven F. Koch, 2021. "Individual and Household Debt: Does Imputation Choice Matter?," Working Papers 202141, University of Pretoria, Department of Economics.
    10. Koning, Stephanie M., 2019. "Displacement contexts and violent landscapes: How conflict and displacement structure women's lives and ongoing threats at the Thai-Myanmar border," Social Science & Medicine, Elsevier, vol. 240(C).
    11. Juana Sanchez & Sydney Noelle Kahmann, 2017. "R&D, Attrition and Multiple Imputation in BRDIS," Working Papers 17-13, Center for Economic Studies, U.S. Census Bureau.
    12. Surun, Clément & Drechsler, Martin, 2018. "Effectiveness of Tradable Permits for the Conservation of Metacommunities With Two Competing Species," Ecological Economics, Elsevier, vol. 147(C), pages 189-196.
    13. Navarro-Miró, D. & Iocola, I. & Persiani, A. & Blanco-Moreno, J.M. & Kristensen, H. Lakkenborg & Hefner, M. & Tamm, K. & Bender, I. & Védie, H. & Willekens, K. & Diacono, M. & Montemurro, F. & Sans, F, 2019. "Energy flows in European organic vegetable systems: Effects of the introduction and management of agroecological service crops," Energy, Elsevier, vol. 188(C).
    14. Laha, A. K. & Putatunda, Sayan, 2017. "Travel Time Prediction for Taxi-GPS Data Streams," IIMA Working Papers WP 2017-03-03, Indian Institute of Management Ahmedabad, Research and Publication Department.
    15. Christopher J Greenwood & George J Youssef & Primrose Letcher & Jacqui A Macdonald & Lauryn J Hagg & Ann Sanson & Jenn Mcintosh & Delyse M Hutchinson & John W Toumbourou & Matthew Fuller-Tyszkiewicz &, 2020. "A comparison of penalised regression methods for informing the selection of predictive markers," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
    16. Kruyt, Bert & Lehning, Michael & Kahl, Annelen, 2017. "Potential contributions of wind power to a stable and highly renewable Swiss power supply," Applied Energy, Elsevier, vol. 192(C), pages 1-11.
    17. Michael A Ruderman & Deirdra F Wilson & Savanna Reid, 2015. "Does Prison Crowding Predict Higher Rates of Substance Use Related Parole Violations? A Recurrent Events Multi-Level Survival Analysis," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-19, October.
    18. Cohen, Joseph N, 2010. "Neoliberalism’s relationship with economic growth in the developing world: Was it the power of the market or the resolution of financial crisis?," MPRA Paper 24527, University Library of Munich, Germany.
    19. Feldkircher, Martin, 2014. "The determinants of vulnerability to the global financial crisis 2008 to 2009: Credit growth and other sources of risk," Journal of International Money and Finance, Elsevier, vol. 43(C), pages 19-49.
    20. Shankar Tumati & Huibert Burger & Sander Martens & Yvonne T van der Schouw & André Aleman, 2016. "Association between Cognition and Serum Insulin-Like Growth Factor-1 in Middle-Aged & Older Men: An 8 Year Follow-Up Study," PLOS ONE, Public Library of Science, vol. 11(4), pages 1-12, April.

    More about this item

    Keywords

    workflow; statistical computing; data science; data visualization; tidyverse; data pipeline.;
    All these keywords.

    JEL classification:

    • C10 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - General
    • C14 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Semiparametric and Nonparametric Methods: General
    • C22 - Mathematical and Quantitative Methods - - Single Equation Models; Single Variables - - - Time-Series Models; Dynamic Quantile Regressions; Dynamic Treatment Effect Models; Diffusion Processes

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:msh:ebswps:2018-14. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: . General contact details of provider: https://edirc.repec.org/data/dxmonau.html .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Dr Xibin Zhang (email available below). General contact details of provider: https://edirc.repec.org/data/dxmonau.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.