IDEAS home Printed from https://ideas.repec.org/a/scn/025686/16374355.html
   My bibliography  Save this article

Annotated suffix tree as a way of text representation for information retrieval in text collections

Author

Listed:
  • FROLOV D.S.

    (National Research University Higher School of Economics)

Abstract

A method for information retrieval based on annotated suffix trees (AST) is presented. The method is based on a string-to-document relevance score calculated using AST as well as fragment reverse indexing for improving performance. We developed a search engine based on the method. This engine is compared with some other popular text aggregating techniques: probabilistic latent semantic indexing (PLSA) and latent Dirichlet allocation (LDA). We used real data for computation experiments: an online store’s xml-catalogs and collections of web pages (both in Russian) and a real user’s queries from the Yandex. Wordstat service. As quality metrics, we used point quality estimations and graphical representations. Our AST-based method generally leads to results that are similar to those obtained by the other methods. However, in the case of inaccurate queries, AST-based results are superior. The speed of the AST-based method is slightly worse than the speed of the PLSA/LDA-based methods. Due to the observed correlation between the average query performing time and the string lengths at the AST construction phase, one can improve the performance of the algorithm by dividing the texts into smaller fragments at the preprocessing stage. However, the quality of search may suffer if the fragments are too short. Therefore, the applicability of annotated suffix tree techniques for text retrieval problems is demonstrated. Moreover, the AST-based method has significant advantages in the case of fuzzy search.

Suggested Citation

  • Frolov D.S., 2015. "Annotated suffix tree as a way of text representation for information retrieval in text collections," Бизнес-информатика, CyberLeninka;Федеральное государственное автономное образовательное учреждение высшего образования «Национальный исследовательский университет «Высшая школа экономики», issue 4 (34), pages 63-70.
  • Handle: RePEc:scn:025686:16374355
    as

    Download full text from publisher

    File URL: http://cyberleninka.ru/article/n/annotated-suffix-tree-as-a-way-of-text-representation-for-information-retrieval-in-text-collections
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:scn:025686:16374355. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CyberLeninka (email available below). General contact details of provider: http://cyberleninka.ru/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.