Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data

Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data

Author

Listed:

Long Liu
Qingyu Meng
Cherry Weng
Qing Lu
Tong Wang
Yalu Wen

Abstract

Building an accurate disease risk prediction model is an essential step in the modern quest for precision medicine. While high-dimensional genomic data provides valuable data resources for the investigations of disease risk, their huge amount of noise and complex relationships between predictors and outcomes have brought tremendous analytical challenges. Deep learning model is the state-of-the-art methods for many prediction tasks, and it is a promising framework for the analysis of genomic data. However, deep learning models generally suffer from the curse of dimensionality and the lack of biological interpretability, both of which have greatly limited their applications. In this work, we have developed a deep neural network (DNN) based prediction modeling framework. We first proposed a group-wise feature importance score for feature selection, where genes harboring genetic variants with both linear and non-linear effects are efficiently detected. We then designed an explainable transfer-learning based DNN method, which can directly incorporate information from feature selection and accurately capture complex predictive effects. The proposed DNN-framework is biologically interpretable, as it is built based on the selected predictive genes. It is also computationally efficient and can be applied to genome-wide data. Through extensive simulations and real data analyses, we have demonstrated that our proposed method can not only efficiently detect predictive features, but also accurately predict disease risk, as compared to many existing methods.Author summary: Accurate disease risk prediction is an essential step towards precision medicine. Deep learning models have achieved the state-of-the-art performance for many prediction tasks. However, they generally suffer from the curse of dimensionality and lack of biological interpretability, both of which have greatly limited their applications to the prediction analysis of whole-genome sequencing data. We present here an explainable deep transfer learning model for the analysis of high-dimensional genomic data. Our proposed method can detect predictive genes that harbor genetic variants with both linear and non-linear effects via the proposed group-wise feature importance score. It can also efficiently and accurately model disease risk based on the detected predictive genes using the proposed transfer-learning based network architecture. Our proposed method is built at the gene level, and thus is much more biologically interpretable. It is also computationally efficiently and can be applied to whole-exome sequencing data that have millions of potential predictors. Through both simulation studies and the analysis of whole-exome data obtained from the Alzheimer’s Disease Neuroimaging Initiative, we have demonstrated that our method can efficiently detect predictive genes and it has better prediction performance than many existing methods.

Suggested Citation

Long Liu & Qingyu Meng & Cherry Weng & Qing Lu & Tong Wang & Yalu Wen, 2022. "Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data," PLOS Computational Biology, Public Library of Science, vol. 18(7), pages 1-23, July.

Handle: RePEc:plo:pcbi00:1010328
DOI: 10.1371/journal.pcbi.1010328

Download full text from publisher

References listed on IDEAS

Margaret Pepe & Holly Janes & Gary Longton & Wendy Leisenring & Polly Newcomb, 2004. "Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic or Prognostic Marker," UW Biostatistics Working Paper Series 1035, Berkeley Electronic Press.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Debashis Ghosh & Michael S. Sabel, 2022. "A Weighted Sample Framework to Incorporate External Calculators for Risk Modeling," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 14(3), pages 363-379, December.
Aljoscha Benjamin Hwang & Guido Schuepfer & Mario Pietrini & Stefan Boes, 2021. "External validation of EPIC’s Risk of Unplanned Readmission model, the LACE+ index and SQLape as predictors of unplanned hospital readmissions: A monocentric, retrospective, diagnostic cohort study in Switzerland," PLOS ONE, Public Library of Science, vol. 16(11), pages 1-33, November.
Anna-Karin Ivert & Marie Torstensson Levander & Juan Merlo, 2013. "Adolescents' Utilisation of Psychiatric Care, Neighbourhoods and Neighbourhood Socioeconomic Deprivation: A Multilevel Analysis," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-1, November.
Margaret Sullivan Pepe & Tianxi Cai & Gary Longton, 2006. "Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve," Biometrics, The International Biometric Society, vol. 62(1), pages 221-229, March.
Holly Janes & Margaret S. Pepe, 2008. "Matching in Studies of Classification Accuracy: Implications for Analysis, Efficiency, and Assessment of Incremental Value," Biometrics, The International Biometric Society, vol. 64(1), pages 1-9, March.
Carlos A Labarrere & John R Woods & James W Hardin & Beate R Jaeger & Marian Zembala & Mario C Deng & Ghassan S Kassab, 2014. "Early Inflammatory Markers Are Independent Predictors of Cardiac Allograft Vasculopathy in Heart-Transplant Recipients," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-18, December.
Pia Kjær Kristensen & Raquel Perez-Vicente & George Leckie & Søren Paaske Johnsen & Juan Merlo, 2020. "Disentangling the contribution of hospitals and municipalities for understanding patient level differences in one-year mortality risk after hip-fracture: A cross-classified multilevel analysis in Sweden," PLOS ONE, Public Library of Science, vol. 15(6), pages 1-14, June.
Diego Tomassi & Liliana Forzani & Efstathia Bura & Ruth Pfeiffer, 2017. "Sufficient dimension reduction for censored predictors," Biometrics, The International Biometric Society, vol. 73(1), pages 220-231, March.
Osamu Komori, 2011. "A boosting method for maximization of the area under the ROC curve," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 63(5), pages 961-979, October.
Quang Bao Le & Boubaker Dhehibi, 2019. "A Typology-Based Approach for Assessing Qualities and Determinants of Adoption of Sustainable Water Use Technologies in Coping with Context Diversity: The Case of Mechanized Raised-Bed Technology in E," Sustainability, MDPI, vol. 11(19), pages 1-21, September.
Kenichi Hayashi & Shinto Eguchi, 2024. "A new integrated discrimination improvement index via odds," Statistical Papers, Springer, vol. 65(8), pages 4971-4990, October.
Tianle Chen & Yuanjia Wang & Huaihou Chen & Karen Marder & Donglin Zeng, 2014. "Targeted Local Support Vector Machine for Age-Dependent Classification," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(507), pages 1174-1187, September.
Hajime Uno & Tianxi Cai & Lu Tian & L. J. Wei, 2011. "Graphical Procedures for Evaluating Overall and Subject-Specific Incremental Values from New Predictors with Censored Event Time Data," Biometrics, The International Biometric Society, vol. 67(4), pages 1389-1396, December.
Anna Persmark & Maria Wemrell & Sofia Zettermark & George Leckie & S V Subramanian & Juan Merlo, 2019. "Precision public health: Mapping socioeconomic disparities in opioid dispensations at Swedish pharmacies by Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA)," PLOS ONE, Public Library of Science, vol. 14(8), pages 1-21, August.
repec:plo:pone00:0019852 is not listed on IDEAS
Rebecca Yates Coley & Aaron J. Fisher & Mufaddal Mamawala & Herbert Ballentine Carter & Kenneth J. Pienta & Scott L. Zeger, 2017. "A Bayesian hierarchical model for prediction of latent health states from multiple data sources with application to active surveillance of prostate cancer," Biometrics, The International Biometric Society, vol. 73(2), pages 625-634, June.
Juan Merlo & Philippe Wagner & Nermin Ghith & George Leckie, 2016. "An Original Stepwise Multilevel Logistic Regression Analysis of Discriminatory Accuracy: The Case of Neighbourhoods and Health," PLOS ONE, Public Library of Science, vol. 11(4), pages 1-31, April.
Michael Lebenbaum & Osvaldo Espin-Garcia & Yi Li & Laura C Rosella, 2018. "Development and validation of a population based risk algorithm for obesity: The Obesity Population Risk Tool (OPoRT)," PLOS ONE, Public Library of Science, vol. 13(1), pages 1-11, January.
Michael King & Louise Marston & Igor Švab & Heidi-Ingrid Maaroos & Mirjam I Geerlings & Miguel Xavier & Vicente Benjamin & Francisco Torres-Gonzalez & Juan Angel Bellon-Saameno & Danica Rotar & Anu Al, 2011. "Development and Validation of a Risk Model for Prediction of Hazardous Alcohol Consumption in General Practice Attendees: The PredictAL Study," PLOS ONE, Public Library of Science, vol. 6(8), pages 1-10, August.
Haleh Yasrebi & Peter Sperisen & Viviane Praz & Philipp Bucher, 2009. "Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?," PLOS ONE, Public Library of Science, vol. 4(10), pages 1-14, October.
Shai Mulinari & Sol Pia Juárez & Philippe Wagner & Juan Merlo, 2015. "Does Maternal Country of Birth Matter for Understanding Offspring’s Birthweight? A Multilevel Analysis of Individual Heterogeneity in Sweden," PLOS ONE, Public Library of Science, vol. 10(5), pages 1-19, May.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010328. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data