IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v9y2021i13p1515-d584261.html
   My bibliography  Save this article

Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

Author

Listed:
  • Bojan Nikolic

    (Faculty of Natural Science and Mathematics, University of Banja Luka, 78 000 Banja Luka, Bosnia and Herzegovina
    These authors contributed equally to this work.)

  • Aleksandar Kartelj

    (Faculty of Mathematics, University of Belgrade, 105104 Belgrade, Serbia
    These authors contributed equally to this work.)

  • Marko Djukanovic

    (Faculty of Natural Science and Mathematics, University of Banja Luka, 78 000 Banja Luka, Bosnia and Herzegovina
    Institute of Logic and Computation, Faculty of Informatics, TU Wien, 1040 Vienna, Austria
    These authors contributed equally to this work.)

  • Milana Grbic

    (Faculty of Natural Science and Mathematics, University of Banja Luka, 78 000 Banja Luka, Bosnia and Herzegovina
    These authors contributed equally to this work.)

  • Christian Blum

    (Artificial Intelligence Research Institute (IIIA-CSIC), Campus UAB, 08193 Bellaterra, Spain
    These authors contributed equally to this work.)

  • Günther Raidl

    (Institute of Logic and Computation, Faculty of Informatics, TU Wien, 1040 Vienna, Austria
    These authors contributed equally to this work.)

Abstract

The longest common subsequence (LCS) problem is a prominent NP –hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum . This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.

Suggested Citation

  • Bojan Nikolic & Aleksandar Kartelj & Marko Djukanovic & Milana Grbic & Christian Blum & Günther Raidl, 2021. "Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings," Mathematics, MDPI, vol. 9(13), pages 1-25, June.
  • Handle: RePEc:gam:jmathe:v:9:y:2021:i:13:p:1515-:d:584261
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/9/13/1515/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/9/13/1515/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:9:y:2021:i:13:p:1515-:d:584261. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.