IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v12y2018i3d10.1007_s11634-016-0274-6.html
   My bibliography  Save this article

Mutual information, phi-squared and model-based co-clustering for contingency tables

Author

Listed:
  • Gérard Govaert

    (U.M.R. C.N.R.S.)

  • Mohamed Nadif

    (University of Paris Descartes)

Abstract

Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.

Suggested Citation

  • Gérard Govaert & Mohamed Nadif, 2018. "Mutual information, phi-squared and model-based co-clustering for contingency tables," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 455-488, September.
  • Handle: RePEc:spr:advdac:v:12:y:2018:i:3:d:10.1007_s11634-016-0274-6
    DOI: 10.1007/s11634-016-0274-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-016-0274-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-016-0274-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Rocci, Roberto & Vichi, Maurizio, 2008. "Two-mode multi-partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(4), pages 1984-2003, January.
    2. Hans-Hermann Bock, 2004. "Convexity-based clustering criteria: theory, algorithms, and applications in statistics," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 12(3), pages 293-317, February.
    3. Govaert, Gérard & Nadif, Mohamed, 2008. "Block clustering with Bernoulli mixture models: Comparison of different approaches," Computational Statistics & Data Analysis, Elsevier, vol. 52(6), pages 3233-3245, February.
    4. Michael Windham, 1987. "Parameter modification for clustering criteria," Journal of Classification, Springer;The Classification Society, vol. 4(2), pages 191-214, September.
    5. Hathaway, Richard J., 1986. "Another interpretation of the EM algorithm for mixture distributions," Statistics & Probability Letters, Elsevier, vol. 4(2), pages 53-56, March.
    6. Celeux, Gilles & Govaert, Gerard, 1992. "A classification EM algorithm for clustering and two stochastic versions," Computational Statistics & Data Analysis, Elsevier, vol. 14(3), pages 315-332, October.
    7. Diane Duffy & adolfo Quiroz, 1991. "A permutation-based algorithm for block clustering," Journal of Classification, Springer;The Classification Society, vol. 8(1), pages 65-91, January.
    8. Govaert, Gerard & Nadif, Mohamed, 2007. "Clustering of contingency table and mixture model," European Journal of Operational Research, Elsevier, vol. 183(3), pages 1055-1066, December.
    9. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    10. Michael Greenacre, 1988. "Clustering the rows and columns of a contingency table," Journal of Classification, Springer;The Classification Society, vol. 5(1), pages 39-51, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Selosse, Margot & Jacques, Julien & Biernacki, Christophe, 2020. "Model-based co-clustering for mixed type data," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    2. Lazhar Labiod & Mohamed Nadif, 2021. "Efficient regularized spectral data embedding," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 15(1), pages 99-119, March.
    3. Paul Riverain & Simon Fossier & Mohamed Nadif, 2023. "Poisson degree corrected dynamic stochastic block model," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(1), pages 135-162, March.
    4. Emilio Carrizosa & Vanesa Guerrero & Dolores Romero Morales, 2023. "On mathematical optimization for clustering categories in contingency tables," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(2), pages 407-429, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Haedo, Christian & Mouchart, Michel, 2019. "Two-mode clustering through profiles of regions and sectors," LIDAM Discussion Papers ISBA 2019014, Université catholique de Louvain, Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA).
    2. Haedo, Christian & Mouchart, Michel, 2018. "Automatic biclustering of regions and sectors," LIDAM Discussion Papers ISBA 2018026, Université catholique de Louvain, Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA).
    3. Bergé, Laurent R. & Bouveyron, Charles & Corneli, Marco & Latouche, Pierre, 2019. "The latent topic block model for the co-clustering of textual interaction data," Computational Statistics & Data Analysis, Elsevier, vol. 137(C), pages 247-270.
    4. Christian Haedo & Michel Mouchart, 2022. "Two-mode clustering through profiles of regions and sectors," Empirical Economics, Springer, vol. 63(4), pages 1971-1996, October.
    5. Haedo, Christian & Mouchart, Michel, 2016. "Automatic biclustering of regions and sectors," LIDAM Discussion Papers ISBA 2016042, Université catholique de Louvain, Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA).
    6. Govaert, Gérard & Nadif, Mohamed, 2008. "Block clustering with Bernoulli mixture models: Comparison of different approaches," Computational Statistics & Data Analysis, Elsevier, vol. 52(6), pages 3233-3245, February.
    7. Bhatia, Parmeet Singh & Iovleff, Serge & Govaert, Gérard, 2017. "blockcluster: An R Package for Model-Based Co-Clustering," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 76(i09).
    8. Govaert, Gerard & Nadif, Mohamed, 2007. "Clustering of contingency table and mixture model," European Journal of Operational Research, Elsevier, vol. 183(3), pages 1055-1066, December.
    9. Carlo Cavicchia & Maurizio Vichi & Giorgia Zaccaria, 2022. "Gaussian mixture model with an extended ultrametric covariance structure," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(2), pages 399-427, June.
    10. Emilio Carrizosa & Vanesa Guerrero & Dolores Romero Morales, 2023. "On mathematical optimization for clustering categories in contingency tables," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(2), pages 407-429, June.
    11. Francesca Martella & Maurizio Vichi, 2012. "Clustering microarray data using model-based double K -means," Journal of Applied Statistics, Taylor & Francis Journals, vol. 39(9), pages 1853-1869, April.
    12. Enrico Carlini & Fabio Rapallo, 2011. "A class of statistical models to weaken independence in two-way contingency tables," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 73(1), pages 1-22, January.
    13. Hu, Tianming & Sung, Sam Yuan, 2006. "A hybrid EM approach to spatial clustering," Computational Statistics & Data Analysis, Elsevier, vol. 50(5), pages 1188-1205, March.
    14. Michael Salter-Townshend & Thomas Murphy, 2014. "Mixtures of biased sentiment analysers," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 85-103, March.
    15. Blazquez-Soriano, Amparo & Ramos-Sandoval, Rosmery, 2022. "Information transfer as a tool to improve the resilience of farmers against the effects of climate change: The case of the Peruvian National Agrarian Innovation System," Agricultural Systems, Elsevier, vol. 200(C).
    16. Irina Wedel & Michael Palk & Stefan Voß, 2022. "A Bilingual Comparison of Sentiment and Topics for a Product Event on Twitter," Information Systems Frontiers, Springer, vol. 24(5), pages 1635-1646, October.
    17. Adrian O’Hagan & Arthur White, 2019. "Improved model-based clustering performance using Bayesian initialization averaging," Computational Statistics, Springer, vol. 34(1), pages 201-231, March.
    18. François Bavaud, 2009. "Aggregation invariance in general clustering approaches," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 3(3), pages 205-225, December.
    19. Mohammed Salem Binwahlan, 2023. "Polynomial Networks Model for Arabic Text Summarization," International Journal of Research and Scientific Innovation, International Journal of Research and Scientific Innovation (IJRSI), vol. 10(2), pages 74-84, February.
    20. Curci, Ylenia & Mongeau Ospina, Christian A., 2016. "Investigating biofuels through network analysis," Energy Policy, Elsevier, vol. 97(C), pages 60-72.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:12:y:2018:i:3:d:10.1007_s11634-016-0274-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.