Author
Listed:
- Jason Bennett
- Mikhail Pomaznoy
- Akul Singhania
- Bjoern Peters
Abstract
Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower ( 14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.Author summary: Next-generation sequencing has spurred the creation of many techniques that attempt to distill large datasets down to a more manageable size without losing valuable information, more simply referred to as dimensionality reduction. We have sought to contribute to this effort by focusing not directly on dimensionality reduction but on interpreting the results of the most common technique used for dimensionality reduction of sequencing data: gene clustering. While methods to generate gene clusters have been well explored, the evaluation of cluster quality has not, i.e., answering the question "Have we made biologically significant clusters?" We have developed a metric that can be used to answer this question. Our metric incorporates prior biological knowledge about the data to determine if the clustering process was optimal by looking at how genes are grouped in gene clusters and determine if they make sense biologically. Our metric can also be used to provide a discrete range of values that indicate how to generate clusters with the highest potential biological information content. This metric can be utilized by any -omics level study to generate study-specific gene clusters while reducing the time spent validating gene clusters and improving confidence in the resultant clusters.
Suggested Citation
Jason Bennett & Mikhail Pomaznoy & Akul Singhania & Bjoern Peters, 2021.
"A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC,"
PLOS Computational Biology, Public Library of Science, vol. 17(10), pages 1-18, October.
Handle:
RePEc:plo:pcbi00:1009459
DOI: 10.1371/journal.pcbi.1009459
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1009459. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.