Clustering Approaches for Mixed‐Type Data: A Comparative Study

Clustering Approaches for Mixed‐Type Data: A Comparative Study

Author

Listed:

Badih Ghattas
(AMSE - Aix-Marseille Sciences Economiques - EHESS - École des hautes études en sciences sociales - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique)
Alvaro Sanchez San-Benito
(Airbus Helicopters - Aeroport International de Marseille-Provence)

Registered:

Badih Ghattas

Abstract

Clustering is widely used in unsupervised learning to fnd homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. Tis study presents the state-of-the-art of these approaches and compares them using various simulation models. Te compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). Te aim is to provide insights into the behavior of diferent methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters' distribution. Te degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a signifcant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.

Suggested Citation

Badih Ghattas & Alvaro Sanchez San-Benito, 2025. "Clustering Approaches for Mixed‐Type Data: A Comparative Study," Post-Print hal-05069567, HAL.

Handle: RePEc:hal:journl:hal-05069567
DOI: 10.1155/jpas/2242100
Note: View the original document on HAL open archive server: https://hal.science/hal-05069567v1

Download full text from publisher

Other versions of this item:

Badih Ghattas & Alvaro Sanchez San-Benito, 2025. "Clustering Approaches for Mixed-Type Data: A Comparative Study," Journal of Probability and Statistics, Hindawi, vol. 2025, pages 1-14, February.
Badih Ghattas & Alvaro Sanchez San-Benito, 2025. "Clustering Approaches for Mixed‐Type Data: A Comparative Study," Journal of Probability and Statistics, John Wiley & Sons, vol. 2025(1).

References listed on IDEAS

Laurence S. J. Roope, 2019. "Characterizing inequality benchmark incomes," Economic Theory Bulletin, Springer;Society for the Advancement of Economic Theory (SAET), vol. 7(1), pages 131-145, May.
R. Gnanadesikan & J. Kettenring & S. Tsao, 1995. "Weighting and selection of variables for cluster analysis," Journal of Classification, Springer;The Classification Society, vol. 12(1), pages 113-136, March.
Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
Philip Hans Franses, 2020. "IMA(1,1) as a new benchmark for forecast evaluation," Applied Economics Letters, Taylor & Francis Journals, vol. 27(17), pages 1419-1423, October.
- Franses, Ph.H.B.F., 2019. "IMA(1,1) as a new benchmark for forecast evaluation," Econometric Institute Research Papers EI2019-28, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute.
Scutari, Marco, 2010. "Learning Bayesian Networks with the bnlearn R Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 35(i03).
Celeux, Gilles & Govaert, Gerard, 1992. "A classification EM algorithm for clustering and two stochastic versions," Computational Statistics & Data Analysis, Elsevier, vol. 14(3), pages 315-332, October.
Christian Hennig & Tim F. Liao, 2013. "How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 62(3), pages 309-369, May.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Francesco Dotto & Alessio Farcomeni & Luis Angel García-Escudero & Agustín Mayo-Iscar, 2017. "A fuzzy approach to robust regression clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 11(4), pages 691-710, December.
Elvira Pelle & Roberta Pappadà, 2021. "A clustering procedure for mixed-type data to explore ego network typologies: an application to elderly people living alone in Italy," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(5), pages 1507-1533, December.
Renato Cordeiro Amorim, 2016. "A Survey on Feature Weighting Based K-Means Algorithms," Journal of Classification, Springer;The Classification Society, vol. 33(2), pages 210-242, July.
Roberto Mari & Roberto Rocci & Stefano Antonio Gattone, 2020. "Scale-constrained approaches for maximum likelihood estimation and model selection of clusterwise linear regression models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 29(1), pages 49-78, March.
Douglas Steinley & Michael Brusco, 2008. "Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures," Psychometrika, Springer;The Psychometric Society, vol. 73(1), pages 125-144, March.
Andrea Cerasa, 2016. "Combining homogeneous groups of preclassified observations with application to international trade," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 70(3), pages 229-259, August.
Volodymyr Melnykov & Xuwen Zhu, 2019. "An extension of the K-means algorithm to clustering skewed data," Computational Statistics, Springer, vol. 34(1), pages 373-394, March.
Zaheer Ahmed & Alberto Cassese & Gerard Breukelen & Jan Schepers, 2023. "E-ReMI: Extended Maximal Interaction Two-mode Clustering," Journal of Classification, Springer;The Classification Society, vol. 40(2), pages 298-331, July.
Rocci, Roberto & Vichi, Maurizio, 2008. "Two-mode multi-partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(4), pages 1984-2003, January.
Sharon M. McNicholas & Paul D. McNicholas & Daniel A. Ashlock, 2021. "An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 264-279, July.
Pietro Coretto & Christian Hennig, 2016. "Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1648-1659, October.
Roberto Mari & Salvatore Ingrassia & Antonio Punzo, 2023. "Local and Overall Deviance R-Squared Measures for Mixtures of Generalized Linear Models," Journal of Classification, Springer;The Classification Society, vol. 40(2), pages 233-266, July.
Marino, Maria Francesca & Pandolfi, Silvia, 2022. "Hybrid maximum likelihood inference for stochastic block models," Computational Statistics & Data Analysis, Elsevier, vol. 171(C).
Aghiles Salah & Mohamed Nadif, 2019. "Directional co-clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(3), pages 591-620, September.
Promskaia, Iuliia & O'Hagan, Adrian & Fop, Michael, 2025. "A Dirichlet stochastic block model for composition-weighted networks," Computational Statistics & Data Analysis, Elsevier, vol. 211(C).
Utkarsh J. Dang & Antonio Punzo & Paul D. McNicholas & Salvatore Ingrassia & Ryan P. Browne, 2017. "Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 34(1), pages 4-34, April.
Efthymios Costa & Ioanna Papatsouma & Angelos Markos, 2023. "Benchmarking distance-based partitioning methods for mixed-type data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(3), pages 701-724, September.
Rabea Aschenbruck & Gero Szepannek & Adalbert F. X. Wilhelm, 2023. "Imputation Strategies for Clustering Mixed-Type Data with Missing Values," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 2-24, April.
Jonathon J. O’Brien & Michael T. Lawson & Devin K. Schweppe & Bahjat F. Qaqish, 2020. "Suboptimal Comparison of Partitions," Journal of Classification, Springer;The Classification Society, vol. 37(2), pages 435-461, July.
Shuchismita Sarkar & Volodymyr Melnykov & Rong Zheng, 2020. "Gaussian mixture modeling and model-based clustering under measurement inconsistency," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(2), pages 379-413, June.

More about this item

Keywords

; ; ; ; ;

NEP fields

This paper has been announced in the following NEP Reports:

NEP-ECM-2025-07-21 (Econometrics)

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:journl:hal-05069567. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Clustering Approaches for Mixed‐Type Data: A Comparative Study

Author

Abstract

Suggested Citation

Download full text from publisher

Other versions of this item:

References listed on IDEAS

Most related items

More about this item

Keywords

NEP fields

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data