IDEAS home Printed from https://ideas.repec.org/a/igg/jirr00/v2y2012i4p12-30.html
   My bibliography  Save this article

Information Retrieval from Unstructured Web Text Document Based on Automatic Learning of the Threshold

Author

Listed:
  • Fethi Fkih

    (MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia)

  • Mohamed Nazih Omri

    (MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia)

Abstract

Collocation is defined as a sequence of lexical tokens which habitually co-occur. This type of information is widely used in various applications such as Information Retrieval, document indexing, machine translation, lexicography, etc. Therefore, many techniques are developed for the automatic retrieval of collocations from textual documents. These techniques use statistical measures based on a joint frequency calculation to quantify the connection strength between the tokens of a candidate collocation. The discrimination between relevant and irrelevant collocations is performed using a priori fixed threshold. Generally, the discrimination threshold estimation is performed manually by a domain expert. This supervised estimation is considered as an additional cost which reduces system performance. In this paper, the authors propose a new technique for the threshold automatic learning to retrieve information from web text document. This technique is mainly based on the usual performance evaluation measures (such as ROC and Precision-Recall curves). The results show the ability to automatically estimate a statistical threshold independently of the treated corpus.

Suggested Citation

  • Fethi Fkih & Mohamed Nazih Omri, 2012. "Information Retrieval from Unstructured Web Text Document Based on Automatic Learning of the Threshold," International Journal of Information Retrieval Research (IJIRR), IGI Global, vol. 2(4), pages 12-30, October.
  • Handle: RePEc:igg:jirr00:v:2:y:2012:i:4:p:12-30
    as

    Download full text from publisher

    File URL: http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/ijirr.2012100102
    Download Restriction: no
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:igg:jirr00:v:2:y:2012:i:4:p:12-30. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Journal Editor (email available below). General contact details of provider: https://www.igi-global.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.