IDEAS home Printed from https://ideas.repec.org/p/cte/wsrepe/24522.html
   My bibliography  Save this paper

Clustering Big Data by Extreme Kurtosis Projections

Author

Listed:
  • Prieto Fernández, Francisco Javier
  • Rendon Aguirre, Janeth Carolina
  • Peña Sánchez de Rivera, Daniel

Abstract

Clustering Big Data is an important problem because large samples of many variables are usually heterogeneous and include mixtures of several populations. It often happens that only some of a large set of variables are useful for clustering and working with all of them would be very inefficient and may make more difficult the identification of the clusters. Thus, searching for spaces of lower dimension that include all the relevant information about the clusters seems a sensible way to proceed in these situations. Peña and Prieto (2001) showed that the extreme kurtosis directions of projected data are optimal when the data has been generated by mixtures of two normal distributions. We generalize this result for any number of mixtures and show that the extreme kurtosis directions of the projected data are linear combinations of the optimal discriminant directions if we knew the centers of the components of the mixture. In order to separate the groups we want directions that split the data into two groups, each corresponding to different components of the mixture. We prove that these directions can be found from extreme kurtosis projections. This result suggests a new procedure to deal with many groups, working in a binary decision way and deciding at each step if the data should be split into two groups or we should stop. The decision is based on comparing a single distribution with a mixture of two distribution. The performance of the algorithm is analyzed through a simulation study.

Suggested Citation

  • Prieto Fernández, Francisco Javier & Rendon Aguirre, Janeth Carolina & Peña Sánchez de Rivera, Daniel, 2017. "Clustering Big Data by Extreme Kurtosis Projections," DES - Working Papers. Statistics and Econometrics. WS 24522, Universidad Carlos III de Madrid. Departamento de Estadística.
  • Handle: RePEc:cte:wsrepe:24522
    as

    Download full text from publisher

    File URL: https://e-archivo.uc3m.es/bitstream/handle/10016/24522/ws1704.pdf?sequence=1
    Download Restriction: no

    References listed on IDEAS

    as
    1. Bouveyron, Charles & Brunet-Saumard, Camille, 2014. "Model-based clustering of high-dimensional data: A review," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 52-78.
    2. Chris Fraley & Adrian E. Raftery, 1999. "MCLUST: Software for Model-Based Cluster Analysis," Journal of Classification, Springer;The Classification Society, vol. 16(2), pages 297-306, July.
    3. Fraiman, Ricardo & Justel, Ana & Svarc, Marcela, 2008. "Selection of Variables for Cluster Analysis and Classification Rules," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 1294-1303.
    4. Sijian Wang & Ji Zhu, 2008. "Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data," Biometrics, The International Biometric Society, vol. 64(2), pages 440-448, June.
    5. Raftery, Adrian E. & Dean, Nema, 2006. "Variable Selection for Model-Based Clustering," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 168-178, March.
    6. Maugis, C. & Celeux, G. & Martin-Magniette, M.-L., 2009. "Variable selection in model-based clustering: A general variable role modeling," Computational Statistics & Data Analysis, Elsevier, vol. 53(11), pages 3872-3882, September.
    7. Witten, Daniela M. & Tibshirani, Robert, 2010. "A Framework for Feature Selection in Clustering," Journal of the American Statistical Association, American Statistical Association, vol. 105(490), pages 713-726.
    Full references (including those not matched with items on IDEAS)

    More about this item

    Keywords

    Mixture models;

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:cte:wsrepe:24522. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Ana Poveda). General contact details of provider: http://portal.uc3m.es/portal/page/portal/dpto_estadistica .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.