IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0285848.html
   My bibliography  Save this article

Importance of missingness in baseline variables: A case study of the All of Us Research Program

Author

Listed:
  • Robert M Cronin
  • Xiaoke Feng
  • Lina Sulieman
  • Brandy Mapes
  • Shawn Garbett
  • Ashley Able
  • Ryan Hale
  • Mick P Couper
  • Heather Sansbury
  • Brian K Ahmedani
  • Qingxia Chen

Abstract

Objective: The All of Us Research Program collects data from multiple information sources, including health surveys, to build a national longitudinal research repository that researchers can use to advance precision medicine. Missing survey responses pose challenges to study conclusions. We describe missingness in All of Us baseline surveys. Study design and setting: We extracted survey responses between May 31, 2017, to September 30, 2020. Missing percentages for groups historically underrepresented in biomedical research were compared to represented groups. Associations of missing percentages with age, health literacy score, and survey completion date were evaluated. We used negative binomial regression to evaluate participant characteristics on the number of missed questions out of the total eligible questions for each participant. Results: The dataset analyzed contained data for 334,183 participants who submitted at least one baseline survey. Almost all (97.0%) of the participants completed all baseline surveys, and only 541 (0.2%) participants skipped all questions in at least one of the baseline surveys. The median skip rate was 5.0% of the questions, with an interquartile range (IQR) of 2.5% to 7.9%. Historically underrepresented groups were associated with higher missingness (incidence rate ratio (IRR) [95% CI]: 1.26 [1.25, 1.27] for Black/African American compared to White). Missing percentages were similar by survey completion date, participant age, and health literacy score. Skipping specific questions were associated with higher missingness (IRRs [95% CI]: 1.39 [1.38, 1.40] for skipping income, 1.92 [1.89, 1.95] for skipping education, 2.19 [2.09–2.30] for skipping sexual and gender questions). Conclusion: Surveys in the All of Us Research Program will form an essential component of the data researchers can use to perform their analyses. Missingness was low in All of Us baseline surveys, but group differences exist. Additional statistical methods and careful analysis of surveys could help mitigate challenges to the validity of conclusions.

Suggested Citation

  • Robert M Cronin & Xiaoke Feng & Lina Sulieman & Brandy Mapes & Shawn Garbett & Ashley Able & Ryan Hale & Mick P Couper & Heather Sansbury & Brian K Ahmedani & Qingxia Chen, 2023. "Importance of missingness in baseline variables: A case study of the All of Us Research Program," PLOS ONE, Public Library of Science, vol. 18(5), pages 1-11, May.
  • Handle: RePEc:plo:pone00:0285848
    DOI: 10.1371/journal.pone.0285848
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285848
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0285848&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0285848?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Joseph G. Ibrahim & Ming-Hui Chen & Stuart R. Lipsitz & Amy H. Herring, 2005. "Missing-Data Methods for Generalized Linear Models: A Comparative Review," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 332-346, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ryo Kato & Takahiro Hoshino, 2020. "Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 72(3), pages 803-825, June.
    2. Li Cai & Lijie Gu & Qihua Wang & Suojin Wang, 2021. "Simultaneous confidence bands for nonparametric regression with missing covariate data," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 73(6), pages 1249-1279, December.
    3. McDonough, Ian K. & Millimet, Daniel L., 2017. "Missing data, imputation, and endogeneity," Journal of Econometrics, Elsevier, vol. 199(2), pages 141-155.
    4. J. Andrew Royle, 2009. "Analysis of Capture–Recapture Models with Individual Covariates Using Data Augmentation," Biometrics, The International Biometric Society, vol. 65(1), pages 267-274, March.
    5. Xie Yanmei & Zhang Biao, 2017. "Empirical Likelihood in Nonignorable Covariate-Missing Data Problems," The International Journal of Biostatistics, De Gruyter, vol. 13(1), pages 1-20, May.
    6. Breunig, Christoph, 2015. "Testing missing at random using instrumental variables," SFB 649 Discussion Papers 2015-016, Humboldt University Berlin, Collaborative Research Center 649: Economic Risk.
    7. Jiang, Depeng & Zhao, Puying & Tang, Niansheng, 2016. "A propensity score adjustment method for regression models with nonignorable missing covariates," Computational Statistics & Data Analysis, Elsevier, vol. 94(C), pages 98-119.
    8. Lei Jin & Suojin Wang, 2010. "A Model Validation Procedure when Covariate Data are Missing at Random," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 37(3), pages 403-421, September.
    9. J. F. Lawless, 2018. "Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 24(1), pages 28-44, January.
    10. Hui Yao & Sungduk Kim & Ming-Hui Chen & Joseph G. Ibrahim & Arvind K. Shah & Jianxin Lin, 2015. "Bayesian Inference for Multivariate Meta-Regression With a Partially Observed Within-Study Sample Covariance Matrix," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(510), pages 528-544, June.
    11. Yi Qian & Hui Xie, 2011. "No Customer Left Behind: A Distribution-Free Bayesian Approach to Accounting for Missing Xs in Marketing Models," Marketing Science, INFORMS, vol. 30(4), pages 717-736, July.
    12. Jiang, Wei & Josse, Julie & Lavielle, Marc, 2020. "Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework," Computational Statistics & Data Analysis, Elsevier, vol. 145(C).
    13. Jared S. Murray & Jerome P. Reiter, 2016. "Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1466-1479, October.
    14. Han, Peisong, 2012. "A note on improving the efficiency of inverse probability weighted estimator using the augmentation term," Statistics & Probability Letters, Elsevier, vol. 82(12), pages 2221-2228.
    15. Baojiang Chen & Xiao-Hua Zhou, 2011. "Doubly Robust Estimates for Binary Longitudinal Data Analysis with Missing Response and Missing Covariates," Biometrics, The International Biometric Society, vol. 67(3), pages 830-842, September.
    16. Zhuoer Sun & Suojin Wang, 2019. "Semiparametric estimation in regression with missing covariates using single-index models," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 71(5), pages 1201-1232, October.
    17. Breunig, Christoph, 2017. "Testing missing at random using instrumental variables," SFB 649 Discussion Papers 2017-007, Humboldt University Berlin, Collaborative Research Center 649: Economic Risk.
    18. Chen, Xue-Dong & Fu, Ying-Zi, 2011. "Model selection for zero-inflated regression with missing covariates," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 765-773, January.
    19. Hongtu Zhu & Joseph G. Ibrahim & Xiaoyan Shi, 2009. "Diagnostic Measures for Generalized Linear Models with Missing Covariates," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 36(4), pages 686-712, December.
    20. Yang, Ying & Kang, Jian, 2010. "Joint analysis of mixed Poisson and continuous longitudinal data with nonignorable missing values," Computational Statistics & Data Analysis, Elsevier, vol. 54(1), pages 193-207, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0285848. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.