IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0224446.html
   My bibliography  Save this article

Gene expression based survival prediction for cancer patients—A topic modeling approach

Author

Listed:
  • Luke Kumar
  • Russell Greiner

Abstract

Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.

Suggested Citation

  • Luke Kumar & Russell Greiner, 2019. "Gene expression based survival prediction for cancer patients—A topic modeling approach," PLOS ONE, Public Library of Science, vol. 14(11), pages 1-30, November.
  • Handle: RePEc:plo:pone00:0224446
    DOI: 10.1371/journal.pone.0224446
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224446
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0224446&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0224446?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. van Wieringen, Wessel N. & Kun, David & Hampel, Regina & Boulesteix, Anne-Laure, 2009. "Survival prediction using gene expression data: A review and comparison," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1590-1603, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yanfeng Wang & Haohao Wang & Sanyi Li & Lidong Wang, 2022. "Survival Risk Prediction of Esophageal Cancer Based on the Kohonen Network Clustering Algorithm and Kernel Extreme Learning Machine," Mathematics, MDPI, vol. 10(9), pages 1-20, April.
    2. Stefanie Hieke & Axel Benner & Richard F Schlenk & Martin Schumacher & Lars Bullinger & Harald Binder, 2016. "Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-18, May.
    3. Yu Takagi & Hirokazu Matsuda & Yukio Taniguchi & Hiroaki Iwaisaki, 2014. "Predicting the Phenotypic Values of Physiological Traits Using SNP Genotype and Gene Expression Data in Mice," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-17, December.
    4. Hapfelmeier, A. & Ulm, K., 2013. "A new variable selection approach using Random Forests," Computational Statistics & Data Analysis, Elsevier, vol. 60(C), pages 50-69.
    5. Ming Yi & Ruoqing Zhu & Robert M Stephens, 2018. "GradientScanSurv—An exhaustive association test method for gene expression data with censored survival outcome," PLOS ONE, Public Library of Science, vol. 13(12), pages 1-28, December.
    6. Armin Rauschenberger & Iuliana Ciocănea-Teodorescu & Marianne A. Jonker & Renée X. Menezes & Mark A. Wiel, 2020. "Sparse classification with paired covariates," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(3), pages 571-588, September.
    7. Farcomeni, Alessio & Nardi, Alessandra, 2010. "A two-component Weibull mixture to model early and late mortality in a Bayesian framework," Computational Statistics & Data Analysis, Elsevier, vol. 54(2), pages 416-428, February.
    8. Antoniadis, Anestis & Fryzlewicz, Piotr & Letué, Frédérique, 2010. "The Dantzig selector in Cox's proportional hazards model," LSE Research Online Documents on Economics 30992, London School of Economics and Political Science, LSE Library.
    9. Isabella Zwiener & Barbara Frisch & Harald Binder, 2014. "Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures," PLOS ONE, Public Library of Science, vol. 9(1), pages 1-13, January.
    10. Zhao, Xiaobing & Zhou, Xian, 2014. "Sufficient dimension reduction on marginal regression for gaps of recurrent events," Journal of Multivariate Analysis, Elsevier, vol. 127(C), pages 56-71.
    11. Wei Zhang & Takayo Ota & Viji Shridhar & Jeremy Chien & Baolin Wu & Rui Kuang, 2013. "Network-based Survival Analysis Reveals Subnetwork Signatures for Predicting Outcomes of Ovarian Cancer Treatment," PLOS Computational Biology, Public Library of Science, vol. 9(3), pages 1-16, March.
    12. Julia Gilhodes & Florence Dalenc & Jocelyn Gal & Christophe Zemmour & Eve Leconte & Jean Marie Boher & Thomas Filleron, 2020. "Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings," Post-Print hal-02934793, HAL.
    13. Xiaolin Chen & Catherine Chunling Liu & Sheng Xu, 2021. "An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model," Computational Statistics, Springer, vol. 36(2), pages 885-910, June.
    14. Emura, Takeshi & Chen, Yi-Hau & Chen, Hsuan-Yu, 2012. "Survival prediction based on compound covariate under cox proportional hazard models," MPRA Paper 41149, University Library of Munich, Germany.
    15. Anestis Antoniadis & Piotr Fryzlewicz & Frédérique Letué, 2010. "The Dantzig Selector in Cox's Proportional Hazards Model," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 37(4), pages 531-552, December.
    16. Christine W Duarte & Christopher D Willey & Degui Zhi & Xiangqin Cui & Jacqueline J Harris & Laura Kelly Vaughan & Tapan Mehta & Raymond O McCubrey & Nikolai N Khodarev & Ralph R Weichselbaum & G Yanc, 2012. "Expression Signature of IFN/STAT1 Signaling Genes Predicts Poor Survival Outcome in Glioblastoma Multiforme in a Subtype-Specific Manner," PLOS ONE, Public Library of Science, vol. 7(1), pages 1-8, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0224446. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.