IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1011049.html
   My bibliography  Save this article

Matrix prior for data transfer between single cell data types in latent Dirichlet allocation

Author

Listed:
  • Alan Min
  • Timothy Durham
  • Louis Gevirtzman
  • William Stafford Noble

Abstract

Single cell ATAC-seq (scATAC-seq) enables the mapping of regulatory elements in fine-grained cell types. Despite this advance, analysis of the resulting data is challenging, and large scale scATAC-seq data are difficult to obtain and expensive to generate. This motivates a method to leverage information from previously generated large scale scATAC-seq or scRNA-seq data to guide our analysis of new scATAC-seq datasets. We analyze scATAC-seq data using latent Dirichlet allocation (LDA), a Bayesian algorithm that was developed to model text corpora, summarizing documents as mixtures of topics defined based on the words that distinguish the documents. When applied to scATAC-seq, LDA treats cells as documents and their accessible sites as words, identifying “topics” based on the cell type-specific accessible sites in those cells. Previous work used uniform symmetric priors in LDA, but we hypothesized that nonuniform matrix priors generated from LDA models trained on existing data sets may enable improved detection of cell types in new data sets, especially if they have relatively few cells. In this work, we test this hypothesis in scATAC-seq data from whole C. elegans nematodes and SHARE-seq data from mouse skin cells. We show that nonsymmetric matrix priors for LDA improve our ability to capture cell type information from small scATAC-seq datasets.Author summary: Identifying cell types based on genomics information is an important task but can present challenges because genomics information can be high-dimensional and contain many zeros. Previous work has used latent Dirichlet allocation (LDA), a method that automatically identifies “topics” within a dataset, and has used these topics to better understand the cell types within a population. LDA has been applied to single cell ATAC-seq datasets, which provide information about open chromatin regions within individual cells. We focus on improving the LDA framework by enabling the incorporation of auxiliary forms of information. In particular, we present a method that uses data from large reference populations of cells to aid in the formation of topics for a smaller, target population of cells. We demonstrate first, through simulation, that our method can recover topics when the data follows the assumptions of our model. We then use a dataset of mouse skin cells and another with C. elegans cells to demonstrate that in a real data setting, our method improves the quality of topics recovered from the genomics data.

Suggested Citation

  • Alan Min & Timothy Durham & Louis Gevirtzman & William Stafford Noble, 2023. "Matrix prior for data transfer between single cell data types in latent Dirichlet allocation," PLOS Computational Biology, Public Library of Science, vol. 19(5), pages 1-19, May.
  • Handle: RePEc:plo:pcbi00:1011049
    DOI: 10.1371/journal.pcbi.1011049
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011049
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011049&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1011049?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kushal K Dey & Chiaowen Joyce Hsiao & Matthew Stephens, 2017. "Visualizing the structure of RNA-seq expression data using grade of membership models," PLOS Genetics, Public Library of Science, vol. 13(3), pages 1-23, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lucia Taraborrelli & Yasin Şenbabaoğlu & Lifen Wang & Junghyun Lim & Kerrigan Blake & Noelyn Kljavin & Sarah Gierke & Alexis Scherl & James Ziai & Erin McNamara & Mark Owyong & Shilpa Rao & Aslihan Ka, 2023. "Tumor-intrinsic expression of the autophagy gene Atg16l1 suppresses anti-tumor immunity in colorectal cancer," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    2. Seymour Picciotto & Nicholas DeVita & Chiaowen Joyce Hsiao & Christopher Honan & Sze-Wah Tse & Mychael Nguyen & Joseph D. Ferrari & Wei Zheng & Brian T. Wipke & Eric Huang, 2022. "Selective activation and expansion of regulatory T cells using lipid encapsulated mRNA encoding a long-acting IL-2 mutein," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    3. Malarvizhi Gurusamy & Denise Tischner & Jingchen Shao & Stephan Klatt & Sven Zukunft & Remy Bonnavion & Stefan Günther & Kai Siebenbrodt & Roxane-Isabelle Kestner & Tanja Kuhlmann & Ingrid Fleming & S, 2021. "G-protein-coupled receptor P2Y10 facilitates chemokine-induced CD4 T cell migration through autocrine/paracrine mediators," Nature Communications, Nature, vol. 12(1), pages 1-16, December.
    4. Xiaotian Wu & Hao Wu & Zhijin Wu, 2021. "Penalized Latent Dirichlet Allocation Model in Single-Cell RNA Sequencing," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(3), pages 543-562, December.
    5. Chachrit Khunsriraksakul & Daniel McGuire & Renan Sauteraud & Fang Chen & Lina Yang & Lida Wang & Jordan Hughey & Scott Eckert & J. Dylan Weissenkampen & Ganesh Shenoy & Olivia Marx & Laura Carrel & B, 2022. "Integrating 3D genomic and epigenomic data to enhance target gene discovery and drug repurposing in transcriptome-wide association studies," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    6. Brendan F. Miller & Feiyang Huang & Lyla Atta & Arpan Sahoo & Jean Fan, 2022. "Reference-free cell type deconvolution of multi-cellular pixel-resolution spatially resolved transcriptomics data," Nature Communications, Nature, vol. 13(1), pages 1-13, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1011049. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.