IDEAS home Printed from https://ideas.repec.org/a/bla/jamist/v55y2004i10p892-910.html
   My bibliography  Save this article

Topic modeling for mediated access to very large document collections

Author

Listed:
  • Gheorghe Muresan
  • David J. Harper

Abstract

Clear and precise queries are a necessity when searching very large document collections, especially when query‐based retrieval is the only means of exploration. We propose system‐mediated information access as a solution for users' well‐documented inability to formulate good queries. Our approach is based on two main assumptions: first, on the ability of document clustering to reveal the topical, semantic structure of a problem domain represented by a specialized “source collection,” and, second, on the capacity of statistical language models to convey content. Taking the role of the human mediator or intermediary searcher, a mediation system interacts with the user and supports her exploration of a relatively small source collection, chosen to be representative for the problem domain. Based on the user's selection of relevant “exemplary” documents and clusters from this source collection, the system builds a language model of her information need. This model is subsequently used to derive “mediated queries,” which are expected to convey precisely and comprehensively the user's information need, and can be submitted by the user to search any large and heterogeneous “target collections.” We present results of experiments that simulated various mediation strategies and compared the effect on mediation effectiveness of a variety of parameters, such as the similarity measure, the weighting scheme, and the clustering method. They provide both upperbounds of performance that can potentially be reached by real end users and a comparison between the effectiveness of these strategies. The experimental evidence suggests that information retrieval mediated through a clustered specialized collection has potential to improve effectiveness significantly.

Suggested Citation

  • Gheorghe Muresan & David J. Harper, 2004. "Topic modeling for mediated access to very large document collections," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 55(10), pages 892-910, August.
  • Handle: RePEc:bla:jamist:v:55:y:2004:i:10:p:892-910
    DOI: 10.1002/asi.20034
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.20034
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.20034?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamist:v:55:y:2004:i:10:p:892-910. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.