IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0241701.html
   My bibliography  Save this article

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka

Author

Listed:
  • P Celard
  • A Seara Vieira
  • E L Iglesias
  • L Borrajo

Abstract

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.

Suggested Citation

  • P Celard & A Seara Vieira & E L Iglesias & L Borrajo, 2020. "LDA filter: A Latent Dirichlet Allocation preprocess method for Weka," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
  • Handle: RePEc:plo:pone00:0241701
    DOI: 10.1371/journal.pone.0241701
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0241701
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0241701&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0241701?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Chih-Chou Chiu & Chung-Min Wu & Te-Nien Chien & Ling-Jing Kao & Chengcheng Li & Chuan-Mei Chu, 2023. "Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method," IJERPH, MDPI, vol. 20(5), pages 1-22, February.
    2. Chenyu Zhang & Jiayue Jiang & Hong Jin & Tinggui Chen, 2021. "The Impact of COVID-19 on Consumers’ Psychological Behavior Based on Data Mining for Online User Comments in the Catering Industry in China," IJERPH, MDPI, vol. 18(8), pages 1-19, April.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0241701. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.