Author
Listed:
- Lulu Pan
- Qian Gao
- Kecheng Wei
- Yongfu Yu
- Guoyou Qin
- Tong Wang
Abstract
Transfer learning aims to integrate useful information from multi-source datasets to improve the learning performance of target data. This can be effectively applied in genomics when we learn the gene associations in a target tissue, and data from other tissues can be integrated. However, heavy-tail distribution and outliers are common in genomics data, which poses challenges to the effectiveness of current transfer learning approaches. In this paper, we study the transfer learning problem under high-dimensional linear models with t-distributed error (Trans-PtLR), which aims to improve the estimation and prediction of target data by borrowing information from useful source data and offering robustness to accommodate complex data with heavy tails and outliers. In the oracle case with known transferable source datasets, a transfer learning algorithm based on penalized maximum likelihood and expectation-maximization algorithm is established. To avoid including non-informative sources, we propose to select the transferable sources based on cross-validation. Extensive simulation experiments as well as an application demonstrate that Trans-PtLR demonstrates robustness and better performance of estimation and prediction when heavy-tail and outliers exist compared to transfer learning for linear regression model with normal error distribution.Data integration, Variable selection, T distribution, Expectation maximization algorithm, Genotype-Tissue Expression, Cross validation.Author summary: Many genetic loci have been shown to be associated with the mechanisms of important disease onset. Therefore, studying the expression of important genes contributes to the diagnosis and treatment of diseases. However, limited target gene expression data poses challenges to studying gene regulation. How to effectively integrate gene expression data from multiple sources is a key issue that needs to be addressed. In this study, we propose a robust transfer learning method aimed at improving the estimation and prediction performance of target gene expression data by integrating information from multiple data sources. By introducing a high-dimensional linear regression model with t-error distribution, our method addresses the shortcomings of previous transfer learning methods faced with heavy-tail distributions and outliers in genomics, providing robustness to complex data features. Extensive simulation experiments and an application demonstrate that our method exhibits better estimation and prediction performance when dealing with gene expression data with heavy-tail distributions and outliers.
Suggested Citation
Lulu Pan & Qian Gao & Kecheng Wei & Yongfu Yu & Guoyou Qin & Tong Wang, 2025.
"A robust transfer learning approach for high-dimensional linear regression to support integration of multi-source gene expression data,"
PLOS Computational Biology, Public Library of Science, vol. 21(1), pages 1-16, January.
Handle:
RePEc:plo:pcbi00:1012739
DOI: 10.1371/journal.pcbi.1012739
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1012739. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.