Clustering techniques play an important role in analyzing high dimensional data that is common in high-throughput screening such as microarray and mass spectrometry data. Effective use of the high dimensionality and some replications can help to increase clustering accuracy and stability. In this article a new partitioning algorithm with a robust distance measure is introduced to cluster variables in high dimensional low sample size (HDLSS) data that contain a large number of independent variables with a small number of replications per variable. The proposed clustering algorithm, PPCLUST, considers data from a mixture distribution and uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity to separate the mixture components. PPCLUST is able to efficiently cluster a large number of variables in the presence of very few replications. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. Numerical studies and an application to microarray gene expression data for colorectal cancer study are discussed.
Download Info
To download:
If you experience problems downloading a file, check if you have the
proper application to
view it first. Information about this may be contained
in the File-Format links below. In case of further problems read
the IDEAS help
page. Note that these files are not on the IDEAS
site. Please be patient as the files may be large.
As the access to this document is restricted, you may want to look for a different version under "Related research" (further below) or search for a different version of it.
Volume (Year): 53 (2009) Issue (Month): 12 (October) Pages: 3987-3998 Download reference. The following formats are available: HTML
(with abstract),
plain text
(with abstract),
BibTeX,
RIS (EndNote, RefMan, ProCite),
ReDIF