Author
Abstract
Dimensionality reduction tools like t-SNE and UMAP are widely used for high-dimensional data analysis. For instance, these tools are applied in biology to describe spiking patterns of neuronal populations or the genetic profiles of different cell types. Here, we show that when data include noise points that are randomly scattered within a high-dimensional space, a “scattering noise problem” occurs in the low-dimensional embedding where noise points overlap with the cluster points. We show that a simple transformation of the original distance matrix by computing a distance between neighbor distances alleviates this problem and identifies the noise points as a separate cluster. We apply this technique to high-dimensional neuronal spike sequences, as well as the representations of natural images by convolutional neural network units, and find an improvement in the constructed low-dimensional embedding. Thus, we present an improved dimensionality reduction technique for high-dimensional data containing noise points.Author summary: Biological datasets are often high-dimensional, e.g. the genetic profile of cells or the firing pattern of neural populations. Dimensionality reduction methods like t-SNE are commonly used to represent the high-dimensional data in a low-dimensional embedding space. The visualization helps us to identify the underlying clustering patterns and shed light on the information hidden within the data. We show that in situations where there exist scattering noise points, clustering patterns in the data tend to be heavily distorted. Here, we show that using a distance-of-distance (DoD) transformation of the dissimilarity matrix between data points, the influence of scattering noise is effectively removed. This neighborhood-based transformation is most effective when the dimensionality of the dataset is high. We show that this technique improves low-dimensional embedding for several high-dimensional datasets, such as the convolutional neural network representation of natural images or the neuronal population representation of visual stimuli.
Suggested Citation
Jinke Liu & Martin Vinck, 2022.
"Improved visualization of high-dimensional data using the distance-of-distance transformation,"
PLOS Computational Biology, Public Library of Science, vol. 18(12), pages 1-19, December.
Handle:
RePEc:plo:pcbi00:1010764
DOI: 10.1371/journal.pcbi.1010764
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010764. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.