IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1013525.html
   My bibliography  Save this article

A mixture of attention experts-embedded flow-based generative model to create synthetic cells in single-cell RNA-Seq datasets

Author

Listed:
  • Sultan Sevgi Turgut Ögme
  • Nizamettin Aydin
  • Zeyneb Kurt

Abstract

Single-cell RNA-seq (scRNAseq) analyses performed at the cellular level aim to understand the cellular landscape of tissue sections, offer insights into rare cell-types, and identify marker genes for annotating distinct cell types. ScRNAseq analyses are widely applied to cancer research to understand tumor heterogeneity, disease progression, and resistance to therapy. Single-cell data processing is a challenging task due to its high-dimensionality, sparsity, and having imbalanced class(cell-type) distributions. An accurate cell-type identification is highly dependent on preprocessing and quality control steps. To address these issues, generative models have been widely used in recent years. Techniques frequently used include Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), Gaussian-based methods, and, more recently, Flow-based (FB) generative models. We developed a Masked Affine Autoregressive transform-embedded FB (MAF-FB) model. Then, to improve MAF-FB further, we incorporated a mixture of experts (MOE) of attention mechanisms on top of it, resulting in our proposed MOE-FB model. We conducted a comparative analysis of fundamental generative models, aiming to serve as a preliminary guidance for developing novel automated scRNAseq data analysis systems. We performed a large-scale analysis by combiningfour datasets derived from pancreatic tissue sections and for further generalizability assessments, we employed Peripheral Blood Mononuclear Cells (PBMC68K and PBMC3K) and Human Cell Atlas Bone Marrow (HCA-BM10K) datasets. We utilized VAE, GAN, Gaussian Copula, and Automated Cell-Type-informed Introspective Variational Autoencoder (ACTIVA), and compared them against our two novel FB models, MAF-FB and MOE-FB for ScRnaseq synthesis. To evaluate the performances of generative models, we used various discrepancy metrics and performed automated cell-type classification tasks. We also identified differentially expressed genes for each cell type, and inferred cell-cell interactions based on ligand-receptor bindings across distinct cell-type pairs. Among the generative models, FB models, especially MOE-FB, consistently outperformed others across all experimental setups in both discrepancy metrics with comparison to the baseline test set and cell-type classification tasks (with an F1-score of 0.90 precision of 0.89 and recall of 0.92 for the integrated pancreatic datasets). MOE-FB produced biologically more relevant synthetic data, and ligand–receptor–based cell–cell interactions inferred from the synthetic cells closely resemble the original data, achieving an RMSE of 0.65 against the corresponding pancreatic test set. These findings highlight the potential and promising use of FB models, especially MOE-FB, in scRNAseq analyses.Author summary: Single-cell RNA sequencing (scRNA-seq) analyses focus on identifying distinct cell types and marker genes. Traditional methods face challenges with high dimensionality, sparsity, and sample size imbalances across cell types, limiting automated and unbiased cell-type identification. Generative AI models address these issues by generating synthetic cells for under-represented cell types, preserving biological and contextual relevance, and employing embedding mechanisms to reduce sparsity and dimensionality. We proposed a Flow Based (MAF-FB) model with Masked Affine Autoregressive transform for single-cell synthesis and a new framework that extends vanilla MAF-FB combined with a mixture of experts of attention mechanism (MOE-FB). We compared widely used generative models (Variational Autoencoders, GANs, Gaussian Copula, and ACTIVA with FB models) using integrated pancreatic and additionally external datasets. Synthetic data quality was assessed via multiple discrepancy metrics, a cell type classification task using a Random Forest model, and a ligand-receptor interaction inference task. The FB models, especially MOE-FB showed the highest potential for creating similar and biologically accurate scRNA-seq profiles to the original data. We presented a guideline for automated cell-type identification systems by addressing gaps in single-cell analysis characteristics through the integration of widely used computational biology datasets and implementation of generative models (including a vanilla and a novel, FB model, MAF-FB and MOE-FB frameworks, respectively).

Suggested Citation

  • Sultan Sevgi Turgut Ögme & Nizamettin Aydin & Zeyneb Kurt, 2025. "A mixture of attention experts-embedded flow-based generative model to create synthetic cells in single-cell RNA-Seq datasets," PLOS Computational Biology, Public Library of Science, vol. 21(10), pages 1-25, October.
  • Handle: RePEc:plo:pcbi00:1013525
    DOI: 10.1371/journal.pcbi.1013525
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013525
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1013525&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1013525?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1013525. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.