IDEAS home Printed from https://ideas.repec.org/a/plo/pgen00/1006599.html
   My bibliography  Save this article

Visualizing the structure of RNA-seq expression data using grade of membership models

Author

Listed:
  • Kushal K Dey
  • Chiaowen Joyce Hsiao
  • Matthew Stephens

Abstract

Grade of membership models, also known as “admixture models”, “topic models” or “Latent Dirichlet Allocation”, are a generalization of cluster models that allow each sample to have membership in multiple clusters. These models are widely used in population genetics to model admixed individuals who have ancestry from multiple “populations”, and in natural language processing to model documents having words from multiple “topics”. Here we illustrate the potential for these models to cluster samples of RNA-seq gene expression data, measured on either bulk samples or single cells. We also provide methods to help interpret the clusters, by identifying genes that are distinctively expressed in each cluster. By applying these methods to several example RNA-seq applications we demonstrate their utility in identifying and summarizing structure and heterogeneity. Applied to data from the GTEx project on 53 human tissues, the approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to single-cell expression data from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and highlights genes involved in a variety of relevant processes—from germ cell development, through compaction and morula formation, to the formation of inner cell mass and trophoblast at the blastocyst stage. The methods are implemented in the Bioconductor package CountClust.Author summary: Gene expression profile of a biological sample (either from single cells or pooled cells) results from a complex interplay of multiple related biological processes. Consequently, for example, distal tissue samples may share a similar gene expression profile through some common underlying biological processes. Our goal here is to illustrate that grade of membership (GoM) models—an approach widely used in population genetics to cluster admixed individuals who have ancestry from multiple populations—provide an attractive approach for clustering biological samples of RNA sequencing data. The GoM model allows each biological sample to have partial memberships in multiple biologically-distinct clusters, in contrast to traditional clustering methods that partition samples into distinct subgroups. We also provide methods for identifying genes that are distinctively expressed in each cluster to help biologically interpret the results. Applied to a dataset of 53 human tissues, the GoM approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to gene expression data of single cells from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and genes involved in a variety of relevant processes. Our study highlights the potential of GoM models for elucidating biological structure in RNA-seq gene expression data.

Suggested Citation

  • Kushal K Dey & Chiaowen Joyce Hsiao & Matthew Stephens, 2017. "Visualizing the structure of RNA-seq expression data using grade of membership models," PLOS Genetics, Public Library of Science, vol. 13(3), pages 1-23, March.
  • Handle: RePEc:plo:pgen00:1006599
    DOI: 10.1371/journal.pgen.1006599
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006599
    Download Restriction: no

    File URL: https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1006599&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pgen.1006599?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Brendan F. Miller & Feiyang Huang & Lyla Atta & Arpan Sahoo & Jean Fan, 2022. "Reference-free cell type deconvolution of multi-cellular pixel-resolution spatially resolved transcriptomics data," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    2. Chachrit Khunsriraksakul & Daniel McGuire & Renan Sauteraud & Fang Chen & Lina Yang & Lida Wang & Jordan Hughey & Scott Eckert & J. Dylan Weissenkampen & Ganesh Shenoy & Olivia Marx & Laura Carrel & B, 2022. "Integrating 3D genomic and epigenomic data to enhance target gene discovery and drug repurposing in transcriptome-wide association studies," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    3. Malarvizhi Gurusamy & Denise Tischner & Jingchen Shao & Stephan Klatt & Sven Zukunft & Remy Bonnavion & Stefan Günther & Kai Siebenbrodt & Roxane-Isabelle Kestner & Tanja Kuhlmann & Ingrid Fleming & S, 2021. "G-protein-coupled receptor P2Y10 facilitates chemokine-induced CD4 T cell migration through autocrine/paracrine mediators," Nature Communications, Nature, vol. 12(1), pages 1-16, December.
    4. Lucia Taraborrelli & Yasin Şenbabaoğlu & Lifen Wang & Junghyun Lim & Kerrigan Blake & Noelyn Kljavin & Sarah Gierke & Alexis Scherl & James Ziai & Erin McNamara & Mark Owyong & Shilpa Rao & Aslihan Ka, 2023. "Tumor-intrinsic expression of the autophagy gene Atg16l1 suppresses anti-tumor immunity in colorectal cancer," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    5. Seymour Picciotto & Nicholas DeVita & Chiaowen Joyce Hsiao & Christopher Honan & Sze-Wah Tse & Mychael Nguyen & Joseph D. Ferrari & Wei Zheng & Brian T. Wipke & Eric Huang, 2022. "Selective activation and expansion of regulatory T cells using lipid encapsulated mRNA encoding a long-acting IL-2 mutein," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    6. Xiaotian Wu & Hao Wu & Zhijin Wu, 2021. "Penalized Latent Dirichlet Allocation Model in Single-Cell RNA Sequencing," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(3), pages 543-562, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1006599. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.