IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v13y2019i1d10.1007_s11634-018-0325-2.html
   My bibliography  Save this article

Unifying data units and models in (co-)clustering

Author

Listed:
  • Christophe Biernacki

    (Inria and CNRS)

  • Alexandre Lourme

    (University of Bordeaux)

Abstract

Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.

Suggested Citation

  • Christophe Biernacki & Alexandre Lourme, 2019. "Unifying data units and models in (co-)clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 7-31, March.
  • Handle: RePEc:spr:advdac:v:13:y:2019:i:1:d:10.1007_s11634-018-0325-2
    DOI: 10.1007/s11634-018-0325-2
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-018-0325-2
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-018-0325-2?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Damien McParland & Isobel Claire Gormley, 2016. "Model based clustering for mixed data: clustMD," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 10(2), pages 155-169, June.
    2. Hilbe,Joseph M., 2014. "Modeling Count Data," Cambridge Books, Cambridge University Press, number 9781107611252.
    3. Biernacki, Christophe & Jacques, Julien, 2013. "A generative model for rank data based on insertion sort algorithm," Computational Statistics & Data Analysis, Elsevier, vol. 58(C), pages 162-176.
    4. Moustaki, Irini & Papageorgiou, Ioulia, 2005. "Latent class models for mixed variables with applications in Archaeometry," Computational Statistics & Data Analysis, Elsevier, vol. 48(3), pages 659-675, March.
    5. Atkinson, A.C. & Riani, M., 2007. "Exploratory tools for clustering multivariate data," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 272-285, September.
    6. McLachlan, G. J. & Peel, D. & Bean, R. W., 2003. "Modelling high-dimensional data by mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 41(3-4), pages 379-388, January.
    7. Prates, Marcos Oliveira & Lachos, Victor Hugo & Barbosa Cabral, Celso Rômulo, 2013. "mixsmsn: Fitting Finite Mixture of Scale Mixture of Skew-Normal Distributions," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 54(i12).
    8. Lebret, Rémi & Iovleff, Serge & Langrognet, Florent & Biernacki, Christophe & Celeux, Gilles & Govaert, Gérard, 2015. "Rmixmod: The R Package of the Model-Based Unsupervised, Supervised, and Semi-Supervised Classification Mixmod Library," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 67(i06).
    9. Tadesse, Mahlet G. & Sha, Naijun & Vannucci, Marina, 2005. "Bayesian Variable Selection in Clustering High-Dimensional Data," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 602-617, June.
    10. Raftery, Adrian E. & Dean, Nema, 2006. "Variable Selection for Model-Based Clustering," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 168-178, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Selosse, Margot & Jacques, Julien & Biernacki, Christophe, 2020. "Model-based co-clustering for mixed type data," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    2. Sanjeena Subedi & Paul D. McNicholas, 2021. "A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting," Journal of Classification, Springer;The Classification Society, vol. 38(1), pages 89-108, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Gilles Celeux & Cathy Maugis-Rabusseau & Mohammed Sedki, 2019. "Variable selection in model-based clustering and discriminant analysis with a regularization approach," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 259-278, March.
    2. Christophe Biernacki & Matthieu Marbac & Vincent Vandewalle, 2021. "Gaussian-Based Visualization of Gaussian and Non-Gaussian-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 38(1), pages 129-157, April.
    3. Bouveyron, Charles & Brunet-Saumard, Camille, 2014. "Model-based clustering of high-dimensional data: A review," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 52-78.
    4. Jian Guo & Elizaveta Levina & George Michailidis & Ji Zhu, 2010. "Pairwise Variable Selection for High-Dimensional Model-Based Clustering," Biometrics, The International Biometric Society, vol. 66(3), pages 793-804, September.
    5. Cathy Maugis & Gilles Celeux & Marie-Laure Martin-Magniette, 2009. "Variable Selection for Clustering with Gaussian Mixture Models," Biometrics, The International Biometric Society, vol. 65(3), pages 701-709, September.
    6. Alessandro Casa & Andrea Cappozzo & Michael Fop, 2022. "Group-Wise Shrinkage Estimation in Penalized Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 648-674, November.
    7. Semhar Michael & Volodymyr Melnykov, 2016. "An effective strategy for initializing the EM algorithm in finite mixture models," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 10(4), pages 563-583, December.
    8. repec:jss:jstsof:18:i06 is not listed on IDEAS
    9. Montanari, Angela & Viroli, Cinzia, 2011. "Maximum likelihood estimation of mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2712-2723, September.
    10. Zhang, Q. & Ip, E.H., 2014. "Variable assessment in latent class models," Computational Statistics & Data Analysis, Elsevier, vol. 77(C), pages 146-156.
    11. Germán Caruso & Walter Sosa-Escudero & Marcela Svarc, 2015. "Deprivation and the Dimensionality of Welfare: A Variable-Selection Cluster-Analysis Approach," Review of Income and Wealth, International Association for Research in Income and Wealth, vol. 61(4), pages 702-722, December.
    12. Bouveyron, C. & Girard, S. & Schmid, C., 2007. "High-dimensional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 502-519, September.
    13. Maugis, C. & Celeux, G. & Martin-Magniette, M.-L., 2009. "Variable selection in model-based clustering: A general variable role modeling," Computational Statistics & Data Analysis, Elsevier, vol. 53(11), pages 3872-3882, September.
    14. Mantas Svazas & Valentinas Navickas & Yuriy Bilan & Joanna Nakonieczny & Jana Spankova, 2021. "Biomass Clusterization from a Regional Perspective: The Case of Lithuania," Energies, MDPI, vol. 14(21), pages 1-15, October.
    15. McNicholas, P.D. & Murphy, T.B. & McDaid, A.F. & Frost, D., 2010. "Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 711-723, March.
    16. Chen, Jiahua & Tan, Xianming, 2009. "Inference for multivariate normal mixtures," Journal of Multivariate Analysis, Elsevier, vol. 100(7), pages 1367-1383, August.
    17. Charles Bouveyron & Camille Brunet-Saumard, 2014. "Discriminative variable selection for clustering with the sparse Fisher-EM algorithm," Computational Statistics, Springer, vol. 29(3), pages 489-513, June.
    18. Morris, Katherine & Punzo, Antonio & McNicholas, Paul D. & Browne, Ryan P., 2019. "Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric Laplace distributions," Computational Statistics & Data Analysis, Elsevier, vol. 132(C), pages 145-166.
    19. Sahin, Özge & Czado, Claudia, 2022. "Vine copula mixture models and clustering for non-Gaussian data," Econometrics and Statistics, Elsevier, vol. 22(C), pages 136-158.
    20. Pełka Marcin, 2019. "Analysis of Happiness in EU Countries Using the Multi-Model Classification based on Models of Symbolic Data," Econometrics. Advances in Applied Data Analysis, Sciendo, vol. 23(3), pages 15-25, September.
    21. Andrews, Jeffrey L. & McNicholas, Paul D. & Subedi, Sanjeena, 2011. "Model-based classification via mixtures of multivariate t-distributions," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 520-529, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:13:y:2019:i:1:d:10.1007_s11634-018-0325-2. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.