IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v54y2010i1p120-134.html
   My bibliography  Save this article

On multivariate binary data clustering and feature weighting

Author

Listed:
  • Bouguila, Nizar

Abstract

This paper presents an approach that partitions data sets of unlabeled binary vectors without a priori information about the number of clusters or the saliency of the features. The unsupervised binary feature selection problem is approached using finite mixture models of multivariate Bernoulli distributions. Using stochastic complexity, the proposed model determines simultaneously the number of clusters in a given data set composed of binary vectors and the saliency of the features used. We conduct different applications involving real data, document classification and images categorization to show the merits of the proposed approach.

Suggested Citation

  • Bouguila, Nizar, 2010. "On multivariate binary data clustering and feature weighting," Computational Statistics & Data Analysis, Elsevier, vol. 54(1), pages 120-134, January.
  • Handle: RePEc:eee:csdana:v:54:y:2010:i:1:p:120-134
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167-9473(09)00261-8
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bengt Muthén, 1978. "Contributions to factor analysis of dichotomous variables," Psychometrika, Springer;The Psychometric Society, vol. 43(4), pages 551-560, December.
    2. D. R. Cox, 1972. "The Analysis of Multivariate Binary Data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 21(2), pages 113-120, June.
    3. Gyllenberg, Mats & Koski, Timo & Verlaan, Martin, 1997. "Classification of Binary Vectors by Stochastic Complexity," Journal of Multivariate Analysis, Elsevier, vol. 63(1), pages 47-72, October.
    4. Mats Gyllenberg & Timo Koski, 1996. "Numerical taxonomy and the principle of maximum entropy," Journal of Classification, Springer;The Classification Society, vol. 13(2), pages 213-229, September.
    5. J. D. Wilbur & J. K. Ghosh & C. H. Nakatsu & S. M. Brouder & R. W. Doerge, 2002. "Variable Selection in High-Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints," Biometrics, The International Biometric Society, vol. 58(2), pages 378-386, June.
    6. Govaert, G. & Nadif, M., 1996. "Comparison of the mixture and the classification maximum likelihood in cluster analysis with binary data," Computational Statistics & Data Analysis, Elsevier, vol. 23(1), pages 65-81, November.
    7. Gilles Celeux & Gérard Govaert, 1991. "Clustering criteria for discrete data and latent class models," Journal of Classification, Springer;The Classification Society, vol. 8(2), pages 157-176, December.
    8. Anders Christoffersson, 1975. "Factor analysis of dichotomized variables," Psychometrika, Springer;The Psychometric Society, vol. 40(1), pages 5-32, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Guillaume Gautreau & Adelme Bazin & Mathieu Gachet & Rémi Planel & Laura Burlot & Mathieu Dubois & Amandine Perrin & Claudine Médigue & Alexandra Calteau & Stéphane Cruveiller & Catherine Matias & Chr, 2020. "PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph," PLOS Computational Biology, Public Library of Science, vol. 16(3), pages 1-27, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yang Yixin & Lü Xin & Ma Jian & Qiao Han, 2014. "A Robust Factor Analysis Model for Dichotomous Data," Journal of Systems Science and Information, De Gruyter, vol. 2(5), pages 437-450, October.
    2. Edward Haertel, 1990. "Continuous and discrete latent structure models for item response data," Psychometrika, Springer;The Psychometric Society, vol. 55(3), pages 477-494, September.
    3. Albert Maydeu-Olivares & Harry Joe, 2006. "Limited Information Goodness-of-fit Testing in Multidimensional Contingency Tables," Psychometrika, Springer;The Psychometric Society, vol. 71(4), pages 713-732, December.
    4. Mark Reiser, 1996. "Analysis of residuals for the multionmial item response model," Psychometrika, Springer;The Psychometric Society, vol. 61(3), pages 509-528, September.
    5. Kamel Jedidi & Wayne DeSarbo, 1991. "A stochastic multidimensional scaling procedure for the spatial representation of three-mode, three-way pick any/J data," Psychometrika, Springer;The Psychometric Society, vol. 56(3), pages 471-494, September.
    6. Park, Junyong, 2009. "Independent rule in classification of multivariate binary data," Journal of Multivariate Analysis, Elsevier, vol. 100(10), pages 2270-2286, November.
    7. Gyllenberg, Mats & Koski, Timo & Verlaan, Martin, 1997. "Classification of Binary Vectors by Stochastic Complexity," Journal of Multivariate Analysis, Elsevier, vol. 63(1), pages 47-72, October.
    8. Beth Reboussin & Kung-Yee Liang, 1998. "An estimating equations approach for the LISCOMP model," Psychometrika, Springer;The Psychometric Society, vol. 63(2), pages 165-182, June.
    9. Govaert, Gérard & Nadif, Mohamed, 2008. "Block clustering with Bernoulli mixture models: Comparison of different approaches," Computational Statistics & Data Analysis, Elsevier, vol. 52(6), pages 3233-3245, February.
    10. Maydeu-Olivares, Albert, 2002. "Limited information estimation and testing of Thurstonian models for preference data," Mathematical Social Sciences, Elsevier, vol. 43(3), pages 467-483, July.
    11. Wayne DeSarbo & Jaewun Cho, 1989. "A stochastic multidimensional scaling vector threshold model for the spatial representation of “pick any/n” data," Psychometrika, Springer;The Psychometric Society, vol. 54(1), pages 105-129, March.
    12. Christel Faes & Marc Aerts & Helena Geys & Geert Molenberghs, 2007. "Model Averaging Using Fractional Polynomials to Estimate a Safe Level of Exposure," Risk Analysis, John Wiley & Sons, vol. 27(1), pages 111-123, February.
    13. Alberto Maydeu-Olivares & Rosa Montaño, 2013. "How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-Fit Statistics in Categorical Data Analysis," Psychometrika, Springer;The Psychometric Society, vol. 78(1), pages 116-133, January.
    14. Christopher T. Whelan, 1991. "Chronic Stress, Social Support and Psychological Distress. Published as 'The Role of Social Support in Mediating the Psychological Consequences of Economic Stress', Sociology of Health and Illness, 19," Papers WP023, Economic and Social Research Institute (ESRI).
    15. X. Jessie Jeng & Huimin Peng & Wenbin Lu, 2021. "Model Selection With Mixed Variables on the Lasso Path," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 83(1), pages 170-184, May.
    16. Albert Maydeu-Olivares, 2006. "Limited information estimation and testing of discretized multivariate normal structural models," Psychometrika, Springer;The Psychometric Society, vol. 71(1), pages 57-77, March.
    17. Kromidha, Endrit & Li, Matthew C., 2019. "Determinants of leadership in online social trading: A signaling theory perspective," Journal of Business Research, Elsevier, vol. 97(C), pages 184-197.
    18. Kim, Chul & Jun, Duk Bin & Park, Sungho, 2018. "Capturing flexible correlations in multiple-discrete choice outcomes using copulas," International Journal of Research in Marketing, Elsevier, vol. 35(1), pages 34-59.
    19. Francesco Bartolucci & Claudia Pigini, 2018. "Partial effects estimation for fixed-effects logit panel data models," Working Papers 431, Universita' Politecnica delle Marche (I), Dipartimento di Scienze Economiche e Sociali.
    20. Richards, Timothy J. & Hamilton, Stephen F. & Yonezawa, Koichi, 2018. "Retail Market Power in a Shopping Basket Model of Supermarket Competition," Journal of Retailing, Elsevier, vol. 94(3), pages 328-342.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:54:y:2010:i:1:p:120-134. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.