Author
Listed:
- Abhinav Dutta
- David Pellow
- Ron Shamir
Abstract
Motivation: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. Results: We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long-read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings. Author summary: Popular long-read mappers use minimizers, the minimal hashed k-mers from overlapping windows, as alignment seeds. Recent work showed that syncmers, which select a fixed set of k-mers as seeds, are more likely to be conserved under errors or mutations than minimizers, making them potentially useful for mapping error-prone long reads. We introduce a framework for creating syncmers, that we call parameterized syncmer schemes, which generalize those introduced so far, and provide a theoretical analysis of their properties. We implemented parameterized syncmer schemes in the minimap2 and Winnowmap2 long-read mappers. Using parameters selected on the basis of our theoretical analysis we demonstrate improved mapping performance, with fewer unmapped and incorrectly mapped reads on a variety of simulated and real datasets. The improvements are consistent across a broad range of compression rates and sequence identities, with the most significant improvements for lower sequence identity (high error or mutation rates) and high compression.
Suggested Citation
Abhinav Dutta & David Pellow & Ron Shamir, 2022.
"Parameterized syncmer schemes improve long-read mapping,"
PLOS Computational Biology, Public Library of Science, vol. 18(10), pages 1-19, October.
Handle:
RePEc:plo:pcbi00:1010638
DOI: 10.1371/journal.pcbi.1010638
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010638. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.