IDEAS home Printed from https://ideas.repec.org/a/spr/metron/v77y2019i3d10.1007_s40300-019-00162-5.html
   My bibliography  Save this article

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Author

Listed:
  • Evženie Uglickich

    (The Czech Academy of Sciences, Institute of Information Theory and Automation)

  • Ivan Nagy

    (The Czech Academy of Sciences, Institute of Information Theory and Automation
    Czech Technical University)

  • Dominika Vlčková

    (Czech Technical University)

Abstract

The paper focuses on a problem of comparing clusterings with the same number of clusters obtained as a result of using different clustering algorithms. It proposes a method of the evaluation of the agreement of clusterings based on the combination of the Cohen’s kappa statistic and the normalized mutual information. The main contributions of the proposed approach are: (i) the reliable use in practice in the case of a small fixed number of clusters, (ii) the suitability to comparing clusterings with a higher number of clusters in contrast with the original statistics, (iii) the independence on size of the data set and shape of clusters. Results of the experimental validation of the proposed statistic using both simulations and real data sets as well as the comparison with the theoretical counterparts are demonstrated.

Suggested Citation

  • Evženie Uglickich & Ivan Nagy & Dominika Vlčková, 2019. "Comparing clusterings using combination of the kappa statistic and entropy-based measure," METRON, Springer;Sapienza Università di Roma, vol. 77(3), pages 253-270, December.
  • Handle: RePEc:spr:metron:v:77:y:2019:i:3:d:10.1007_s40300-019-00162-5
    DOI: 10.1007/s40300-019-00162-5
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s40300-019-00162-5
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s40300-019-00162-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Schütz, Thomas & Schraven, Markus Hans & Fuchs, Marcus & Remmen, Peter & Müller, Dirk, 2018. "Comparison of clustering algorithms for the selection of typical demand days for energy system synthesis," Renewable Energy, Elsevier, vol. 129(PA), pages 570-582.
    2. Utkarsh J. Dang & Antonio Punzo & Paul D. McNicholas & Salvatore Ingrassia & Ryan P. Browne, 2017. "Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 34(1), pages 4-34, April.
    3. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    4. Salvatore Ingrassia & Antonio Punzo & Giorgio Vittadini & Simona Minotti, 2015. "The Generalized Linear Mixed Cluster-Weighted Model," Journal of Classification, Springer;The Classification Society, vol. 32(1), pages 85-113, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Diani, Cecilia & Galimberti, Giuliano & Soffritti, Gabriele, 2022. "Multivariate cluster-weighted models based on seemingly unrelated linear regression," Computational Statistics & Data Analysis, Elsevier, vol. 171(C).
    2. Salvatore D. Tomarchio & Paul D. McNicholas & Antonio Punzo, 2021. "Matrix Normal Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 38(3), pages 556-575, October.
    3. Michael P. B. Gallaugher & Salvatore D. Tomarchio & Paul D. McNicholas & Antonio Punzo, 2022. "Multivariate cluster weighted models using skewed distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(1), pages 93-124, March.
    4. Keefe Murphy & Thomas Brendan Murphy, 2020. "Gaussian parsimonious clustering models with covariates and a noise component," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(2), pages 293-325, June.
    5. Sangkon Oh & Byungtae Seo, 2023. "Merging Components in Linear Gaussian Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 25-51, April.
    6. Benjamin Auder & Elisabeth Gassiat & Mor Absa Loum, 2021. "Least squares moment identification of binary regression mixture models," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 84(4), pages 561-593, May.
    7. Salvatore Ingrassia & Antonio Punzo, 2020. "Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition," Journal of Classification, Springer;The Classification Society, vol. 37(2), pages 526-547, July.
    8. Morris, Katherine & Punzo, Antonio & McNicholas, Paul D. & Browne, Ryan P., 2019. "Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric Laplace distributions," Computational Statistics & Data Analysis, Elsevier, vol. 132(C), pages 145-166.
    9. Kim, Nam-Hwui & Browne, Ryan P., 2021. "In the pursuit of sparseness: A new rank-preserving penalty for a finite mixture of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 160(C).
    10. Počuča, Nikola & Jevtić, Petar & McNicholas, Paul D. & Miljkovic, Tatjana, 2020. "Modeling frequency and severity of claims with the zero-inflated generalized cluster-weighted models," Insurance: Mathematics and Economics, Elsevier, vol. 94(C), pages 79-93.
    11. Gabriele Soffritti, 2021. "Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 38(3), pages 594-625, October.
    12. Gabriele Perrone & Gabriele Soffritti, 2023. "Seemingly unrelated clusterwise linear regression for contaminated data," Statistical Papers, Springer, vol. 64(3), pages 883-921, June.
    13. Yang, Yu-Chen & Lin, Tsung-I & Castro, Luis M. & Wang, Wan-Lun, 2020. "Extending finite mixtures of t linear mixed-effects models with concomitant covariates," Computational Statistics & Data Analysis, Elsevier, vol. 148(C).
    14. Roberto Mari & Salvatore Ingrassia & Antonio Punzo, 2023. "Local and Overall Deviance R-Squared Measures for Mixtures of Generalized Linear Models," Journal of Classification, Springer;The Classification Society, vol. 40(2), pages 233-266, July.
    15. Michael P. B. Gallaugher & Paul D. McNicholas, 2019. "On Fractionally-Supervised Classification: Weight Selection and Extension to the Multivariate t-Distribution," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 232-265, July.
    16. Utkarsh J. Dang & Antonio Punzo & Paul D. McNicholas & Salvatore Ingrassia & Ryan P. Browne, 2017. "Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 34(1), pages 4-34, April.
    17. Antonio Punzo & Paul. D. McNicholas, 2017. "Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model," Journal of Classification, Springer;The Classification Society, vol. 34(2), pages 249-293, July.
    18. Angelo Mazza & Antonio Punzo, 2020. "Mixtures of multivariate contaminated normal regression models," Statistical Papers, Springer, vol. 61(2), pages 787-822, April.
    19. Sanjeena Subedi & Paul D. McNicholas, 2021. "A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting," Journal of Classification, Springer;The Classification Society, vol. 38(1), pages 89-108, April.
    20. Naderi, Mehrdad & Mirfarah, Elham & Wang, Wan-Lun & Lin, Tsung-I, 2023. "Robust mixture regression modeling based on the normal mean-variance mixture distributions," Computational Statistics & Data Analysis, Elsevier, vol. 180(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:metron:v:77:y:2019:i:3:d:10.1007_s40300-019-00162-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.