Author
Listed:
- Eliezyer Fermino de Oliveira
- Pranjal Garg
- Jens Hjerling-Leffler
- Renata Batista-Brito
- Lucas Sjulson
Abstract
High-dimensional data have become ubiquitous in the biological sciences, and it is often desirable to compare two datasets collected under different experimental conditions to extract low-dimensional patterns enriched in one condition. However, traditional dimensionality reduction techniques cannot accomplish this because they operate on only one dataset. Contrastive principal component analysis (cPCA) has been proposed to address this problem, but it has seen little adoption because it requires tuning a hyperparameter resulting in multiple solutions, with no way of knowing which is correct. Moreover, cPCA uses foreground and background conditions that are treated differently, making it ill-suited to compare two experimental conditions symmetrically. Here we describe the development of generalized contrastive PCA (gcPCA), a flexible hyperparameter-free approach that solves these problems. We first provide analyses explaining why cPCA requires a hyperparameter and how gcPCA avoids this requirement. We then describe an open-source gcPCA toolbox containing Python and MATLAB implementations of several variants of gcPCA tailored for different scenarios. Finally, we demonstrate the utility of gcPCA in analyzing diverse high-dimensional biological data, revealing unsupervised detection of hippocampal replay in neurophysiological recordings and heterogeneity of type II diabetes in single-cell RNA sequencing data. As a fast, robust, and easy-to-use comparison method, gcPCA provides a valuable resource facilitating the analysis of diverse high-dimensional datasets to gain new insights into complex biological phenomena.Author summary: Technological advances in the biological sciences have led to the proliferation of large, complex datasets for which analysis is challenging. Analyses for these datasets rely heavily on dimensionality reduction techniques, which extract reduced-complexity representations of the data that are easier to analyze and interpret. However, these techniques typically operate on only one dataset, and many biological experiments involve comparing two datasets collected under different conditions. Contrastive principal components analysis (cPCA) was previously developed for this purpose, but it has limitations that have precluded its widespread adoption. Here we introduce generalized contrastive principal components analysis (gcPCA), a method that overcomes these limitations. We first explain the mathematical basis of gcPCA, then describe an open-source gcPCA toolbox with implementations in Python and MATLAB. Finally, we demonstrate the utility of gcPCA in analyzing diverse biological datasets, highlighting its versatility as a tool to compare experimental data collected under two different conditions.
Suggested Citation
Eliezyer Fermino de Oliveira & Pranjal Garg & Jens Hjerling-Leffler & Renata Batista-Brito & Lucas Sjulson, 2025.
"Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA,"
PLOS Computational Biology, Public Library of Science, vol. 21(2), pages 1-23, February.
Handle:
RePEc:plo:pcbi00:1012747
DOI: 10.1371/journal.pcbi.1012747
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1012747. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.