IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v79y2023i2p1280-1292.html
   My bibliography  Save this article

A cross‐validation statistical framework for asymmetric data integration

Author

Listed:
  • Lam Tran
  • Kevin He
  • Di Wang
  • Hui Jiang

Abstract

The proliferation of biobanks and large public clinical data sets enables their integration with a smaller amount of locally gathered data for the purposes of parameter estimation and model prediction. However, public data sets may be subject to context‐dependent confounders and the protocols behind their generation are often opaque; naively integrating all external data sets equally can bias estimates and lead to spurious conclusions. Weighted data integration is a potential solution, but current methods still require subjective specifications of weights and can become computationally intractable. Under the assumption that local data are generated from the set of unknown true parameters, we propose a novel weighted integration method based upon using the external data to minimize the local data leave‐one‐out cross validation (LOOCV) error. We demonstrate how the optimization of LOOCV errors for linear and Cox proportional hazards models can be rewritten as functions of external data set integration weights. Significant reductions in estimation error and prediction error are shown using simulation studies mimicking the heterogeneity of clinical data as well as a real‐world example using kidney transplant patients from the Scientific Registry of Transplant Recipients.

Suggested Citation

  • Lam Tran & Kevin He & Di Wang & Hui Jiang, 2023. "A cross‐validation statistical framework for asymmetric data integration," Biometrics, The International Biometric Society, vol. 79(2), pages 1280-1292, June.
  • Handle: RePEc:bla:biomet:v:79:y:2023:i:2:p:1280-1292
    DOI: 10.1111/biom.13685
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13685
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13685?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Yuan Jiang & Yunxiao He & Heping Zhang, 2016. "Variable Selection With Prior Information for Generalized Linear Models via the Prior LASSO Method," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 355-376, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Huangdi Yi & Qingzhao Zhang & Cunjie Lin & Shuangge Ma, 2022. "Information‐incorporated Gaussian graphical model for gene expression data," Biometrics, The International Biometric Society, vol. 78(2), pages 512-523, June.
    2. Pei Wang & Shunjie Chen & Sijia Yang, 2022. "Recent Advances on Penalized Regression Models for Biological Data," Mathematics, MDPI, vol. 10(19), pages 1-24, October.
    3. Haibo Chu & Jiahua Wei & Yuan Jiang, 2021. "Middle- and Long-Term Streamflow Forecasting and Uncertainty Analysis Using Lasso-DBN-Bootstrap Model," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 35(8), pages 2617-2632, June.
    4. Zhang Haixiang & Zheng Yinan & Zhang Zhou & Gao Tao & Joyce Brian & Zhang Wei & Hou Lifang & Liu Lei & Yoon Grace & Schwartz Joel & Vokonas Pantel & Colicino Elena & Baccarelli Andrea, 2017. "Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(3), pages 159-171, August.
    5. Chen, Shunjie & Yang, Sijia & Wang, Pei & Xue, Liugen, 2023. "Two-stage penalized algorithms via integrating prior information improve gene selection from omics data," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 628(C).
    6. Lee, Juyong & Reiner, David M., 2023. "Determinants of public preferences on low-carbon energy sources: Evidence from the United Kingdom," Energy, Elsevier, vol. 284(C).
    7. Xu, Ganggang & Zhu, Huirong & Lee, J. Jack, 2020. "Borrowing strength and borrowing index for Bayesian hierarchical models," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    8. Kristoffer Pons Bertelsen, 2022. "The Prior Adaptive Group Lasso and the Factor Zoo," CREATES Research Papers 2022-05, Department of Economics and Business Economics, Aarhus University.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:79:y:2023:i:2:p:1280-1292. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.