IDEAS home Printed from https://ideas.repec.org/a/hin/jnlmpe/3720358.html
   My bibliography  Save this article

Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques

Author

Listed:
  • Abdullah Y. Muaad
  • Hanumanthappa Jayappa Davanagere
  • D.S. Guru
  • J.V. Bibal Benifa
  • Channabasava Chola
  • Hussain AlSalman
  • Abdu H. Gumaei
  • Mugahed A. Al-antari
  • Dost Muhammad Khan

Abstract

With the increasing number of online social posts, review comments, and digital documentations, the Arabic text classification (ATC) task has been hugely required for many spontaneous natural language processing (NLP) applications, especially within the coronavirus pandemics. The variations in the meaning of the same Arabic words could directly affect the performance of any AI-based framework. This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques. This effectiveness is measured via different AI-based classification techniques. Basically, the ATC process is influenced by several factors such as stemming in preprocessing, method of feature extraction and selection, nature of datasets, and classification algorithm. To improve the overall classification performance, preprocessing techniques are mainly used to convert each Arabic word into its root and decrease the representation dimension among the datasets. Feature extraction and selection always play crucial roles to represent the Arabic text in a meaningful way and improve the classification accuracy rate. The selected classifiers in this study are performed based on various feature selection algorithms. The overall classification evaluation results are compared using different classifiers such as multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and Linear SVC. All of these AI classifiers are evaluated using five balanced and unbalanced benchmark datasets: BBC Arabic corpus, CNN Arabic corpus, Open-Source Arabic corpus (OSAc), ArCovidVac, and AlKhaleej. The evaluation results show that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used. For the considered benchmark datasets, the linear SVC has outperformed other classifiers overall when prominent features are selected.

Suggested Citation

  • Abdullah Y. Muaad & Hanumanthappa Jayappa Davanagere & D.S. Guru & J.V. Bibal Benifa & Channabasava Chola & Hussain AlSalman & Abdu H. Gumaei & Mugahed A. Al-antari & Dost Muhammad Khan, 2022. "Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques," Mathematical Problems in Engineering, Hindawi, vol. 2022, pages 1-16, April.
  • Handle: RePEc:hin:jnlmpe:3720358
    DOI: 10.1155/2022/3720358
    as

    Download full text from publisher

    File URL: http://downloads.hindawi.com/journals/mpe/2022/3720358.pdf
    Download Restriction: no

    File URL: http://downloads.hindawi.com/journals/mpe/2022/3720358.xml
    Download Restriction: no

    File URL: https://libkey.io/10.1155/2022/3720358?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hin:jnlmpe:3720358. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Mohamed Abdelhakeem (email available below). General contact details of provider: https://www.hindawi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.