IDEAS home Printed from https://ideas.repec.org/p/hal/journl/hal-02157744.html
   My bibliography  Save this paper

Statistically validated hierarchical clustering: Nested partitions in hierarchical trees

Author

Listed:
  • Christian Bongiorno

    (MICS - Mathématiques et Informatique pour la Complexité et les Systèmes - CentraleSupélec - Université Paris-Saclay)

  • Salvatore Miccichè

    (DiFC - Dipartimento di Fisica e Chimica [Palermo] - Università degli studi di Palermo - University of Palermo)

  • Rosario N Mantegna

    (DiFC - Dipartimento di Fisica e Chimica [Palermo] - Università degli studi di Palermo - University of Palermo, CSHV - Complexity Science Hub Vienna, UCL-CS - Department of Computer science [University College of London] - UCL - University College of London [London])

Abstract

We develop a greedy algorithm that is fast and scalable in the detection of a nested partition extracted from a dendrogram obtained from hierarchical clustering of a multivariate series. Our algorithm provides a p-value for each clade observed in the hierarchical tree. The p-value is obtained by computing a number of bootstrap replicas of the dissimilarity matrix and by performing a statistical test on each difference between the dissimilarity associated with a given clade and the dissimilarity of the clade of its parent node. We prove the efficacy of our algorithm with a set of benchmarks generated by using a hierarchical factor model. We compare the results obtained by our algorithm with those of Pvclust. Pvclust is a widely used algorithm developed with a global approach originally motivated by phylogenetic studies. In our numerical experiments we focus on the role of multiple hypothesis test correction and on the robustness of the algorithms to inaccuracy and errors of datasets. We also apply our algorithm to a reference empirical dataset. We verify that our algorithm is much faster than Pvclust algorithm and has a better scalability both in the number of elements and in the number of records of the investigated multivariate set. Our algorithm provides a hierarchically nested partition in much shorter time than currently widely used algorithms allowing to perform a statistically validated cluster analysis detection in very large systems.

Suggested Citation

  • Christian Bongiorno & Salvatore Miccichè & Rosario N Mantegna, 2022. "Statistically validated hierarchical clustering: Nested partitions in hierarchical trees," Post-Print hal-02157744, HAL.
  • Handle: RePEc:hal:journl:hal-02157744
    DOI: 10.1016/j.physa.2022.126933
    Note: View the original document on HAL open archive server: https://hal.science/hal-02157744
    as

    Download full text from publisher

    File URL: https://hal.science/hal-02157744/document
    Download Restriction: no

    File URL: https://libkey.io/10.1016/j.physa.2022.126933?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:journl:hal-02157744. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.