IDEAS home Printed from https://ideas.repec.org/a/bla/jamist/v56y2005i3p272-282.html
   My bibliography  Save this article

Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis

Author

Listed:
  • Kar Wing Li
  • Christopher C. Yang

Abstract

For the sake of national security, very large volumes of data and information are generated and gathered daily. Much of this data and information is written in different languages, stored in different locations, and may be seemingly unconnected. Crosslingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analyzed, shared, searched, and summarized. The recent terrorist attacks and the tragic events of September 11, 2001 have prompted increased attention on national security and criminal analysis. Many Asian countries and cities, such as Japan, Taiwan, and Singapore, have been advised that they may become the next targets of terrorist attacks. Semantic interoperability has been a focus in digital library research. Traditional information retrieval (IR) approaches normally require a document to share some common keywords with the query. Generating the associations for the related terms between the two term spaces of users and documents is an important issue. The problem can be viewed as the creation of a thesaurus. Apart from this, terrorists and criminals may communicate through letters, e‐mails, and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. The problem is expanded to crosslingual semantic interoperability. In this paper, we focus on the English/Chinese crosslingual semantic interoperability problem. However, the developed techniques are not limited to English and Chinese languages but can be applied to many other languages. English and Chinese are popular languages in the Asian region. Much information about national security or crime is communicated in these languages. An efficient automatically generated thesaurus between these languages is important to crosslingual information retrieval between English and Chinese languages. To facilitate crosslingual information retrieval, a corpus‐based approach uses the term co‐occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. In this paper, the text‐based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. We also introduce an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus‐like, semantic network knowledge base, which can aid in semantics‐based crosslingual information management and retrieval.

Suggested Citation

  • Kar Wing Li & Christopher C. Yang, 2005. "Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 56(3), pages 272-282, February.
  • Handle: RePEc:bla:jamist:v:56:y:2005:i:3:p:272-282
    DOI: 10.1002/asi.20118
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.20118
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.20118?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamist:v:56:y:2005:i:3:p:272-282. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.