Author
Listed:
- John Sundh
- Emma Granqvist
- Ela Iwaszkiewicz-Eggebrecht
- Lokeshwaran Manoharan
- Laura J A van Dijk
- Robert Goodsell
- Nerivania N Godeiro
- Bruno C Bellini
- Johanna Orsholm
- Piotr Łukasik
- Andreia Miraldo
- Tomas Roslin
- Ayco J M Tack
- Anders F Andersson
- Fredrik Ronquist
Abstract
Deep metabarcoding offers an efficient and reproducible approach to biodiversity monitoring, but noisy data and incomplete reference databases challenge accurate diversity estimation and taxonomic annotation. Here, we introduce a novel algorithm, NEEAT, for removing spurious operational taxonomic units (OTUs) originating from nuclear-embedded mitochondrial DNA sequences (NUMTs) or sequencing errors. It integrates ‘echo’ signals across samples with the identification of unusual evolutionary patterns among similar DNA sequences. We also extensively benchmark current tools for chimera removal, taxonomic annotation and OTU clustering of deep metabarcoding data. The best performing tools/parameter settings are integrated into HAPP, a high-accuracy pipeline for processing deep metabarcoding data. Tests using CO1 data from BOLD and large-scale metabarcoding data on insects demonstrate that HAPP significantly outperforms existing methods, while enabling efficient analysis of extensive datasets by parallelizing computations across taxonomic groups.Author summary: Charting and monitoring biodiversity is essential for understanding and protecting ecosystems, but it has been difficult to collect data cost-efficiently at scale. An approach that potentially solves this problem is metabarcoding—a method that can be applied to DNA from environmental samples to identify many species at once. Unfortunately, it may produce misleading results due to noise in the data. A particularly challenging problem when analysing data from mitochondrial DNA, such as the CO1 gene often used for analysing insect biodiversity, is the existence of nuclear encoded copies of the gene that can severely inflate diversity estimates. We created an algorithm called NEEAT that helps remove such misleading signals by combining information from multiple samples and spotting unusual patterns of genetic change. We also tested many existing tools for other steps of data processing, and combined NEEAT with the best tools in creating a new, high-accuracy analysis pipeline we call HAPP. Using both simulated and real-world insect data, we show that our approach is not only more accurate than current methods but also efficient at handling large datasets. Our work aims to make biodiversity studies more precise and scalable, supporting better conservation and environmental decision-making.
Suggested Citation
John Sundh & Emma Granqvist & Ela Iwaszkiewicz-Eggebrecht & Lokeshwaran Manoharan & Laura J A van Dijk & Robert Goodsell & Nerivania N Godeiro & Bruno C Bellini & Johanna Orsholm & Piotr Łukasik & And, 2025.
"HAPP: High-accuracy pipeline for processing deep metabarcoding data,"
PLOS Computational Biology, Public Library of Science, vol. 21(11), pages 1-23, November.
Handle:
RePEc:plo:pcbi00:1013558
DOI: 10.1371/journal.pcbi.1013558
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1013558. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.