IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0164383.html
   My bibliography  Save this article

Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers

Author

Listed:
  • Mansour Alsaleh
  • Abdulrahman Alarifi

Abstract

Web spammers aim to obtain higher ranks for their web pages by including spam contents that deceive search engines in order to include their pages in search results even when they are not related to the search terms. Search engines continue to develop new web spam detection mechanisms, but spammers also aim to improve their tools to evade detection. In this study, we first explore the effect of the page language on spam detection features and we demonstrate how the best set of detection features varies according to the page language. We also study the performance of Google Penguin, a newly developed anti-web spamming technique for their search engine. Using spam pages in Arabic as a case study, we show that unlike similar English pages, Google anti-spamming techniques are ineffective against a high proportion of Arabic spam pages. We then explore multiple detection features for spam pages to identify an appropriate set of features that yields a high detection accuracy compared with the integrated Google Penguin technique. In order to build and evaluate our classifier, as well as to help researchers to conduct consistent measurement studies, we collected and manually labeled a corpus of Arabic web pages, including both benign and spam pages. Furthermore, we developed a browser plug-in that utilizes our classifier to warn users about spam pages after clicking on a URL and by filtering out search engine results. Using Google Penguin as a benchmark, we provide an illustrative example to show that language-based web spam classifiers are more effective for capturing spam contents.

Suggested Citation

  • Mansour Alsaleh & Abdulrahman Alarifi, 2016. "Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers," PLOS ONE, Public Library of Science, vol. 11(11), pages 1-25, November.
  • Handle: RePEc:plo:pone00:0164383
    DOI: 10.1371/journal.pone.0164383
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164383
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0164383&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0164383?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0164383. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.