IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1005777.html
   My bibliography  Save this article

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Author

Listed:
  • Yaron Orenstein
  • David Pellow
  • Guillaume Marçais
  • Ron Shamir
  • Carl Kingsford

Abstract

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.Author summary: High-throughput sequencing data has been accumulating at an extreme pace. The need to efficiently analyze and process it has become a critical challenge of the field. Many of the data structures and algorithms for this task rely on k-mer sets (DNA words of length k) to represent the sequences in a dataset. The runtime and memory usage of these highly depend on the size of the k-mer sets used. Thus, a minimum-size k-mer hitting set, namely, a set of k-mers that hit (have non-empty overlap with) all sequences, is desirable. In this work, we create universal k-mer hitting sets that hit any L-long sequence. We present several heuristic approaches for constructing such small sets; the approaches vary in the trade-off between the size of the produced set and runtime and memory usage. We show the benefit in practice of using the produced universal k-mer hitting sets compared to minimizers and randomly created hitting sets on the human genome.

Suggested Citation

  • Yaron Orenstein & David Pellow & Guillaume Marçais & Ron Shamir & Carl Kingsford, 2017. "Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing," PLOS Computational Biology, Public Library of Science, vol. 13(10), pages 1-15, October.
  • Handle: RePEc:plo:pcbi00:1005777
    DOI: 10.1371/journal.pcbi.1005777
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005777
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005777&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1005777?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. V. Chvatal, 1979. "A Greedy Heuristic for the Set-Covering Problem," Mathematics of Operations Research, INFORMS, vol. 4(3), pages 233-235, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Davidov, Sreten & Pantoš, Miloš, 2017. "Planning of electric vehicle infrastructure based on charging reliability and quality of service," Energy, Elsevier, vol. 118(C), pages 1156-1167.
    2. Filipe Rodrigues & Agostinho Agra & Lars Magnus Hvattum & Cristina Requejo, 2021. "Weighted proximity search," Journal of Heuristics, Springer, vol. 27(3), pages 459-496, June.
    3. Lan, Guanghui & DePuy, Gail W. & Whitehouse, Gary E., 2007. "An effective and simple heuristic for the set covering problem," European Journal of Operational Research, Elsevier, vol. 176(3), pages 1387-1403, February.
    4. Song, Zhe & Kusiak, Andrew, 2010. "Mining Pareto-optimal modules for delayed product differentiation," European Journal of Operational Research, Elsevier, vol. 201(1), pages 123-128, February.
    5. Seona Lee & Sang-Ho Lee & HyungJune Lee, 2020. "Timely directional data delivery to multiple destinations through relay population control in vehicular ad hoc network," International Journal of Distributed Sensor Networks, , vol. 16(5), pages 15501477209, May.
    6. Zhuang, Yanling & Zhou, Yun & Yuan, Yufei & Hu, Xiangpei & Hassini, Elkafi, 2022. "Order picking optimization with rack-moving mobile robots and multiple workstations," European Journal of Operational Research, Elsevier, vol. 300(2), pages 527-544.
    7. Menghong Li & Yingli Ran & Zhao Zhang, 2022. "A primal-dual algorithm for the minimum power partial cover problem," Journal of Combinatorial Optimization, Springer, vol. 44(3), pages 1913-1923, October.
    8. Wang, Yiyuan & Pan, Shiwei & Al-Shihabi, Sameh & Zhou, Junping & Yang, Nan & Yin, Minghao, 2021. "An improved configuration checking-based algorithm for the unicost set covering problem," European Journal of Operational Research, Elsevier, vol. 294(2), pages 476-491.
    9. C Guéret & N Jussien & O Lhomme & C Pavageau & C Prins, 2003. "Loading aircraft for military operations," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 54(5), pages 458-465, May.
    10. Keisuke Murakami, 2018. "Iterative Column Generation Algorithm for Generalized Multi-Vehicle Covering Tour Problem," Asia-Pacific Journal of Operational Research (APJOR), World Scientific Publishing Co. Pte. Ltd., vol. 35(04), pages 1-22, August.
    11. R. L. Francis & T. J. Lowe & Arie Tamir, 2000. "Aggregation Error Bounds for a Class of Location Models," Operations Research, INFORMS, vol. 48(2), pages 294-307, April.
    12. Dongyue Liang & Zhao Zhang & Xianliang Liu & Wei Wang & Yaolin Jiang, 2016. "Approximation algorithms for minimum weight partial connected set cover problem," Journal of Combinatorial Optimization, Springer, vol. 31(2), pages 696-712, February.
    13. Abdullah Alshehri & Mahmoud Owais & Jayadev Gyani & Mishal H. Aljarbou & Saleh Alsulamy, 2023. "Residual Neural Networks for Origin–Destination Trip Matrix Estimation from Traffic Sensor Information," Sustainability, MDPI, vol. 15(13), pages 1-21, June.
    14. June Sung Park & Jinyoung Jang & Eunjung Lee, 0. "Theoretical and empirical studies on essence-based adaptive software engineering," Information Technology and Management, Springer, vol. 0, pages 1-13.
    15. Wedelin, Dag, 1995. "The design of a 0-1 integer optimizer and its application in the Carmen system," European Journal of Operational Research, Elsevier, vol. 87(3), pages 722-730, December.
    16. Sun, Yi-Fan & Sun, Zheng-Yang, 2019. "Target observation of complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 517(C), pages 233-245.
    17. Victor Reyes & Ignacio Araya, 2021. "A GRASP-based scheme for the set covering problem," Operational Research, Springer, vol. 21(4), pages 2391-2408, December.
    18. Taoqing Zhou & Zhipeng Lü & Yang Wang & Junwen Ding & Bo Peng, 2016. "Multi-start iterated tabu search for the minimum weight vertex cover problem," Journal of Combinatorial Optimization, Springer, vol. 32(2), pages 368-384, August.
    19. Giovanni Felici & Sokol Ndreca & Aldo Procacci & Benedetto Scoppola, 2016. "A-priori upper bounds for the set covering problem," Annals of Operations Research, Springer, vol. 238(1), pages 229-241, March.
    20. Owais, Mahmoud & Moussa, Ghada S. & Hussain, Khaled F., 2019. "Sensor location model for O/D estimation: Multi-criteria meta-heuristics approach," Operations Research Perspectives, Elsevier, vol. 6(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1005777. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.