IDEAS home Printed from https://ideas.repec.org/a/bit/bsrysr/v10y2019i1p74-87n6.html
   My bibliography  Save this article

Albanian Text Classification: Bag of Words Model and Word Analogies

Author

Listed:
  • Kadriu Arbana

    (SEE University, Tetovo, Macedonia)

  • Abazi Lejla

    (SEE University, Tetovo, Macedonia)

  • Abazi Hyrije

    (SEE University, Tetovo, Macedonia)

Abstract

Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

Suggested Citation

  • Kadriu Arbana & Abazi Lejla & Abazi Hyrije, 2019. "Albanian Text Classification: Bag of Words Model and Word Analogies," Business Systems Research, Sciendo, vol. 10(1), pages 74-87, April.
  • Handle: RePEc:bit:bsrysr:v:10:y:2019:i:1:p:74-87:n:6
    as

    Download full text from publisher

    File URL: https://www.degruyter.com/view/j/bsrj.2019.10.issue-1/bsrj-2019-0006/bsrj-2019-0006.xml?format=INT
    Download Restriction: no

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bit:bsrysr:v:10:y:2019:i:1:p:74-87:n:6. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Peter Golla). General contact details of provider: https://www.sciendo.com/services/journals .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.