IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v19y2025i1d10.1007_s11634-023-00578-y.html
   My bibliography  Save this article

k-means clustering for persistent homology

Author

Listed:
  • Yueqi Cao

    (Imperial College London, South Kensington Campus)

  • Prudence Leung

    (Imperial College London, South Kensington Campus)

  • Anthea Monod

    (Imperial College London, South Kensington Campus)

Abstract

Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful applications to many domains, however, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the k-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform numerical experiments on both simulated and real data of various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures. We find that k-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.

Suggested Citation

  • Yueqi Cao & Prudence Leung & Anthea Monod, 2025. "k-means clustering for persistent homology," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 19(1), pages 95-119, March.
  • Handle: RePEc:spr:advdac:v:19:y:2025:i:1:d:10.1007_s11634-023-00578-y
    DOI: 10.1007/s11634-023-00578-y
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-023-00578-y
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-023-00578-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lorin Crawford & Anthea Monod & Andrew X. Chen & Sayan Mukherjee & Raúl Rabadán, 2020. "Predicting Clinical Outcomes in Glioblastoma: An Application of Topological and Functional Data Analysis," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 115(531), pages 1139-1150, July.
    2. Umar Islambekov & Yulia R. Gel, 2019. "Unsupervised space–time clustering using persistent homology," Environmetrics, John Wiley & Sons, Ltd., vol. 30(4), June.
    3. Robert Thorndike, 1953. "Who belongs in the family?," Psychometrika, Springer;The Psychometric Society, vol. 18(4), pages 267-276, December.
    4. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhao, Yingrui & Hu, Songhua & Zhang, Ming, 2024. "Evaluating equitable Transit-Oriented development (TOD) via the Node-Place-People model," Transportation Research Part A: Policy and Practice, Elsevier, vol. 185(C).
    2. Boztug, Yasemin & Reutterer, Thomas, 2008. "A combined approach for segment-specific market basket analysis," European Journal of Operational Research, Elsevier, vol. 187(1), pages 294-312, May.
    3. Michael Brusco & Hans-Friedrich Köhn, 2009. "Exemplar-Based Clustering via Simulated Annealing," Psychometrika, Springer;The Psychometric Society, vol. 74(3), pages 457-475, September.
    4. Douglas Steinley, 2007. "Validating Clusters with the Lower Bound for Sum-of-Squares Error," Psychometrika, Springer;The Psychometric Society, vol. 72(1), pages 93-106, March.
    5. repec:hum:wpaper:sfb649dp2006-006 is not listed on IDEAS
    6. Boztuğ, Yasemin & Reutterer, Thomas, 2006. "A combined approach for segment-specific analysis of market basket data," SFB 649 Discussion Papers 2006-006, Humboldt University Berlin, Collaborative Research Center 649: Economic Risk.
    7. Sara Dolnicar & Friedrich Leisch, 2010. "Evaluation of structure and reproducibility of cluster solutions using the bootstrap," Marketing Letters, Springer, vol. 21(1), pages 83-101, March.
    8. Wu, Han-Ming & Tien, Yin-Jing & Chen, Chun-houh, 2010. "GAP: A graphical environment for matrix visualization and cluster analysis," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 767-778, March.
    9. José E. Chacón, 2021. "Explicit Agreement Extremes for a 2 × 2 Table with Given Marginals," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 257-263, July.
    10. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    11. Redivo, Edoardo & Nguyen, Hien D. & Gupta, Mayetri, 2020. "Bayesian clustering of skewed and multimodal data using geometric skewed normal distributions," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    12. Becken, Susanne & Stantic, Bela & Chen, Jinyan & Connolly, Rod M., 2022. "Twitter conversations reveal issue salience of aviation in the broader context of climate change," Journal of Air Transport Management, Elsevier, vol. 98(C).
    13. Orietta Nicolis & Jean Paul Maidana & Fabian Contreras & Danilo Leal, 2024. "Analyzing the Impact of COVID-19 on Economic Sustainability: A Clustering Approach," Sustainability, MDPI, vol. 16(4), pages 1-30, February.
    14. Zhu, Xuwen & Melnykov, Volodymyr, 2018. "Manly transformation in finite mixture modeling," Computational Statistics & Data Analysis, Elsevier, vol. 121(C), pages 190-208.
    15. Amiri, Babak & Karimianghadim, Ramin, 2024. "A novel text clustering model based on topic modelling and social network analysis," Chaos, Solitons & Fractals, Elsevier, vol. 181(C).
    16. Li, Pai-Ling & Chiou, Jeng-Min, 2011. "Identifying cluster number for subspace projected functional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2090-2103, June.
    17. A van Giessen & K G M Moons & G A de Wit & W M M Verschuren & J M A Boer & H Koffijberg, 2015. "Tailoring the Implementation of New Biomarkers Based on Their Added Predictive Value in Subgroups of Individuals," PLOS ONE, Public Library of Science, vol. 10(1), pages 1-14, January.
    18. Yaeji Lim & Hee-Seok Oh & Ying Kuen Cheung, 2019. "Multiscale Clustering for Functional Data," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 368-391, July.
    19. Stefano Tonellato & Andrea Pastore, 2013. "On the comparison of model-based clustering solutions," Working Papers 2013:05, Department of Economics, University of Venice "Ca' Foscari".
    20. Elvira Pelle & Roberta Pappadà, 2021. "A clustering procedure for mixed-type data to explore ego network typologies: an application to elderly people living alone in Italy," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(5), pages 1507-1533, December.
    21. Renato Cordeiro Amorim, 2016. "A Survey on Feature Weighting Based K-Means Algorithms," Journal of Classification, Springer;The Classification Society, vol. 33(2), pages 210-242, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:19:y:2025:i:1:d:10.1007_s11634-023-00578-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.