IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v182y2023ics0167947323000257.html
   My bibliography  Save this article

SMLSOM: The shrinking maximum likelihood self-organizing map

Author

Listed:
  • Motegi, Ryosuke
  • Seki, Yoichi

Abstract

Determining the number of clusters in a dataset is a fundamental issue in data clustering. Many methods have been proposed to solve the problem of selecting the number of clusters, considering it to be a problem with regard to model selection. This paper proposes an efficient algorithm that automatically selects a suitable number of clusters based on a probability distribution model framework. The algorithm includes the following two components. First, a generalization of Kohonen's self-organizing map (SOM) is introduced. In Kohonen's SOM, clusters are modeled as mean vectors. In the generalized SOM, each cluster is modeled as a probabilistic distribution and constructed by samples classified based on the likelihood. Second, the dynamically updating method of the SOM structure is introduced. In Kohonen's SOM, each cluster is tied to a node of a fixed two-dimensional lattice space and learned using neighborhood relations between nodes based on Euclidean distance. The extended SOM defines a graph with clusters as vertices and neighborhood relations as links and updates the graph structure by cutting weakly connected and unnecessary vertex deletions. The weakness of a link is measured using the Kullback–Leibler divergence, and the redundancy of a vertex is measured using the minimum description length. Those extensions make it efficient to determine the appropriate number of clusters. Compared with existing methods, the proposed method is computationally efficient and can accurately select the number of clusters.

Suggested Citation

  • Motegi, Ryosuke & Seki, Yoichi, 2023. "SMLSOM: The shrinking maximum likelihood self-organizing map," Computational Statistics & Data Analysis, Elsevier, vol. 182(C).
  • Handle: RePEc:eee:csdana:v:182:y:2023:i:c:s0167947323000257
    DOI: 10.1016/j.csda.2023.107714
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947323000257
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2023.107714?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Hansen M. H & Yu B., 2001. "Model Selection and the Principle of Minimum Description Length," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 746-774, June.
    2. Sylvia. Richardson & Peter J. Green, 1997. "On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 59(4), pages 731-792.
    3. Melnykov, Volodymyr & Chen, Wei-Chen & Maitra, Ranjan, 2012. "MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 51(i12).
    4. P. M. Hartigan, 1985. "Computation of the Dip Statistic to Test for Unimodality," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 34(3), pages 320-325, November.
    5. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    6. Fraley, Chris & Raftery, Adrian, 2007. "Model-based Methods of Classification: Using the mclust Software in Chemometrics," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 18(i06).
    7. Hofmeyr, David P., 2020. "Degrees of freedom and model selection for k-means clustering," Computational Statistics & Data Analysis, Elsevier, vol. 149(C).
    8. Corsini, Noemi & Viroli, Cinzia, 2022. "Dealing with overdispersion in multivariate count data," Computational Statistics & Data Analysis, Elsevier, vol. 170(C).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Xuwen Zhu & Volodymyr Melnykov, 2015. "Probabilistic assessment of model-based clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 9(4), pages 395-422, December.
    2. Zhu, Xuwen & Melnykov, Volodymyr, 2018. "Manly transformation in finite mixture modeling," Computational Statistics & Data Analysis, Elsevier, vol. 121(C), pages 190-208.
    3. Wang, Ketong & Porter, Michael D., 2018. "Optimal Bayesian clustering using non-negative matrix factorization," Computational Statistics & Data Analysis, Elsevier, vol. 128(C), pages 395-411.
    4. Im, Yunju & Tan, Aixin, 2021. "Bayesian subgroup analysis in regression using mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 162(C).
    5. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    6. Yuan Fang & Dimitris Karlis & Sanjeena Subedi, 2022. "Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 510-552, November.
    7. Sylvia Frühwirth-Schnatter & Gertraud Malsiner-Walli, 2019. "From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 33-64, March.
    8. Andrea Cerasa, 2016. "Combining homogeneous groups of preclassified observations with application to international trade," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 70(3), pages 229-259, August.
    9. Abby Flynt & Nema Dean & Rebecca Nugent, 2019. "sARI: a soft agreement measure for class partitions incorporating assignment probabilities," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 303-323, March.
    10. Dolnicar, Sara & Grün, Bettina & Leisch, Friedrich, 2016. "Increasing sample size compensates for data problems in segmentation studies," Journal of Business Research, Elsevier, vol. 69(2), pages 992-999.
    11. Lin, Tsung-I & McLachlan, Geoffrey J. & Lee, Sharon X., 2016. "Extending mixtures of factor models using the restricted multivariate skew-normal distribution," Journal of Multivariate Analysis, Elsevier, vol. 143(C), pages 398-413.
    12. Melnykov, Volodymyr, 2013. "On the distribution of posterior probabilities in finite mixture models with application in clustering," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 175-189.
    13. Melnykov, Volodymyr, 2016. "ClickClust: An R Package for Model-Based Clustering of Categorical Sequences," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i09).
    14. Melnykov, Igor & Melnykov, Volodymyr, 2014. "On K-means algorithm with the use of Mahalanobis distances," Statistics & Probability Letters, Elsevier, vol. 84(C), pages 88-95.
    15. Efthymios Costa & Ioanna Papatsouma & Angelos Markos, 2023. "Benchmarking distance-based partitioning methods for mixed-type data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(3), pages 701-724, September.
    16. Marco Berrettini & Giuliano Galimberti & Saverio Ranciati, 2023. "Semiparametric finite mixture of regression models with Bayesian P-splines," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(3), pages 745-775, September.
    17. Siow Hoo Leong & Seng Huat Ong, 2017. "Similarity measure and domain adaptation in multiple mixture model clustering: An application to image processing," PLOS ONE, Public Library of Science, vol. 12(7), pages 1-30, July.
    18. Nalan Basturk & Lennart Hoogerheide & Herman K. van Dijk, 2021. "Bayes estimates of multimodal density features using DNA and Economic Data," Tinbergen Institute Discussion Papers 21-017/III, Tinbergen Institute.
    19. Sanjeena Subedi & Paul D. McNicholas, 2021. "A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting," Journal of Classification, Springer;The Classification Society, vol. 38(1), pages 89-108, April.
    20. Olsen, Jerome & Kasper, Matthias & Kogler, Christoph & Muehlbacher, Stephan & Kirchler, Erich, 2019. "Mental accounting of income tax and value added tax among self-employed business owners," Journal of Economic Psychology, Elsevier, vol. 70(C), pages 125-139.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:182:y:2023:i:c:s0167947323000257. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.