IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1005518.html
   My bibliography  Save this article

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Author

Listed:
  • Yunpeng Cai
  • Wei Zheng
  • Jin Yao
  • Yujie Yang
  • Volker Mai
  • Qi Mao
  • Yijun Sun

Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

Suggested Citation

  • Yunpeng Cai & Wei Zheng & Jin Yao & Yujie Yang & Volker Mai & Qi Mao & Yijun Sun, 2017. "ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time," PLOS Computational Biology, Public Library of Science, vol. 13(4), pages 1-16, April.
  • Handle: RePEc:plo:pcbi00:1005518
    DOI: 10.1371/journal.pcbi.1005518
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005518
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005518&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1005518?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Tao Ding & Patrick D. Schloss, 2014. "Dynamics and associations of microbial community types across the human body," Nature, Nature, vol. 509(7500), pages 357-360, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. He-Li Sun & Yuan Feng & Qinge Zhang & Jia-Xin Li & Yue-Ying Wang & Zhaohui Su & Teris Cheung & Todd Jackson & Sha Sha & Yu-Tao Xiang, 2022. "The Microbiome–Gut–Brain Axis and Dementia: A Bibliometric Analysis," IJERPH, MDPI, vol. 19(24), pages 1-14, December.
    2. C C Lyman & G R Holyoak & K Meinkoth & X Wieneke & K A Chillemi & U DeSilva, 2019. "Canine endometrial and vaginal microbiomes reveal distinct and complex ecosystems," PLOS ONE, Public Library of Science, vol. 14(1), pages 1-17, January.
    3. Julien Tap & Franck Lejzerowicz & Aurélie Cotillard & Matthieu Pichaud & Daniel McDonald & Se Jin Song & Rob Knight & Patrick Veiga & Muriel Derrien, 2023. "Global branches and local states of the human gut microbiome define associations with environmental and intrinsic factors," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    4. Bo-Young Hong & Michel V Furtado Araujo & Linda D Strausbaugh & Evimaria Terzi & Effie Ioannidou & Patricia I Diaz, 2015. "Microbiome Profiles in Periodontitis in Relation to Host and Disease Characteristics," PLOS ONE, Public Library of Science, vol. 10(5), pages 1-14, May.
    5. Rajita Menon & Vivek Ramanan & Kirill S Korolev, 2018. "Interactions between species introduce spurious associations in microbiome studies," PLOS Computational Biology, Public Library of Science, vol. 14(1), pages 1-20, January.
    6. Sean M Gibbons & Sean M Kearney & Chris S Smillie & Eric J Alm, 2017. "Two dynamic regimes in the human gut microbiome," PLOS Computational Biology, Public Library of Science, vol. 13(2), pages 1-20, February.
    7. Doris Vandeputte & Lindsey Commer & Raul Y. Tito & Gunter Kathagen & João Sabino & Séverine Vermeire & Karoline Faust & Jeroen Raes, 2021. "Temporal variability in quantitative human gut microbiome profiles and implications for clinical research," Nature Communications, Nature, vol. 12(1), pages 1-13, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1005518. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.