IDEAS home Printed from
   My bibliography  Save this article

A simple method for screening variables before clustering microarray data


  • Krzanowski, Wojtek J.
  • Hand, David J.


A simple and computationally fast procedure is proposed for screening a large number of variables prior to cluster analysis. Each variable is considered in turn, the sample is divided into the two groups that maximise the ratio of between-group to within-group sum of squares for that variable, and the achieved value of this ratio is tested to see if it is significantly greater than what would be expected when partitioning a sample from a single homogeneous population. Those variables that achieve significance are then used in the cluster analysis. It is suggested that significance levels be assessed using a Monte Carlo computational procedure; by assuming within-group normality an analytical approximation is derived, but caution in its use is advocated. Computational details are provided for both the partitioning and the testing. The procedure is applied to several microarray data sets, showing that it can often achieve good results both quickly and simply.

Suggested Citation

  • Krzanowski, Wojtek J. & Hand, David J., 2009. "A simple method for screening variables before clustering microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 53(7), pages 2747-2753, May.
  • Handle: RePEc:eee:csdana:v:53:y:2009:i:7:p:2747-2753

    Download full text from publisher

    File URL:
    Download Restriction: Full text for ScienceDirect subscribers only.

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    1. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    2. Sinae Kim & Mahlet G. Tadesse & Marina Vannucci, 2006. "Variable selection in clustering via Dirichlet process mixture models," Biometrika, Biometrika Trust, vol. 93(4), pages 877-893, December.
    3. Michael Brusco & J. Cradit, 2001. "A variable-selection heuristic for K-means clustering," Psychometrika, Springer;The Psychometric Society, vol. 66(2), pages 249-270, June.
    4. E. Fowlkes & R. Gnanadesikan & J. Kettenring, 1988. "Variable selection in clustering," Journal of Classification, Springer;The Classification Society, vol. 5(2), pages 205-228, September.
    5. Tadesse, Mahlet G. & Sha, Naijun & Vannucci, Marina, 2005. "Bayesian Variable Selection in Clustering High-Dimensional Data," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 602-617, June.
    Full references (including those not matched with items on IDEAS)


    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

    Cited by:

    1. Pacheco, JoaquĆ­n & Casado, Silvia & Porras, Santiago, 2013. "Exact methods for variable selection in principal component analysis: Guide functions and pre-selection," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 95-111.
    2. Brusco, Michael J., 2014. "A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis," Computational Statistics & Data Analysis, Elsevier, vol. 77(C), pages 38-53.

    More about this item


    Access and download statistics


    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:53:y:2009:i:7:p:2747-2753. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Dana Niculescu). General contact details of provider: .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.