IDEAS home Printed from https://ideas.repec.org/a/bla/jinfst/v70y2019i7p729-741.html
   My bibliography  Save this article

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Author

Listed:
  • Iqra Muneer
  • Muhammad Sharjeel
  • Muntaha Iqbal
  • Rao Muhammad Adeel Nawab
  • Paul Rayson

Abstract

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large‐scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross‐language sentence/passage level text reuse corpus for the English‐Urdu language pair. The Cross‐Language English‐Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono‐lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross‐language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross‐language text reuse detection systems for the English‐Urdu language pair.

Suggested Citation

  • Iqra Muneer & Muhammad Sharjeel & Muntaha Iqbal & Rao Muhammad Adeel Nawab & Paul Rayson, 2019. "CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 70(7), pages 729-741, July.
  • Handle: RePEc:bla:jinfst:v:70:y:2019:i:7:p:729-741
    DOI: 10.1002/asi.24074
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.24074
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.24074?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:70:y:2019:i:7:p:729-741. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.