IDEAS home Printed from https://ideas.repec.org/h/tkp/mklp15/583-591.html
   My bibliography  Save this book chapter

A New Variables Selection And Dimensionality Reduction Technique Coupled with Simca Method for the Classification of text Documents

Author

Listed:
  • Ahmed Abdelfattah Saleh

    (University of Brasilia, Brasil)

  • Li Weigang

    (University of Brasilia, Brasil)

Abstract

Classification of text documents is of significant importance in the field of data mining and machine learning. However, the vector representation of documents, in classification problems, results in a highly sparse data with immense number of variables. This necessitates applying an efficient variables selection and dimensionality reduction technique that ensures model’s selectivity, accuracy and robustness with fewer variables. This paper introduces a new coefficient, the Variables Strength Coefficient (VSC), which permits retaining variables with strong Modeling and Discriminatory powers. A variable with VSC greater than a predefined threshold is considered to have strong power in both modeling data and discriminating classes and thus retained, while weaker variables are discarded. This straightforward technique results in maximizing the differences between classes while preserving the modeling power of variables. This paper also proposes applying a classification technique that is widely used in chemical analysis domain; the supervised learning algorithm SIMCA. The soft and independent nature of SIMCA allows multi-labeling of text documents, in addition to, the ability to include new classes later on without affecting the created model. VSC-SIMCA was applied on the data set ‘CNAE-9’ and the results obtained were compared to classification and dimensionality reduction work done on the same data set in the literature. VSC-SIMCA technique shows superior performance over other techniques, both in the amount of dimensionality reduction, as well as, the classification performance. The improved classification precision, with substantial fewer variables, demonstrates the contribution of the proposed approach of this research.

Suggested Citation

Download full text from publisher

File URL: http://www.toknowpress.net/ISBN/978-961-6914-13-0/papers/ML15-119.pdf
File Function: full text
Download Restriction: no

File URL: http://www.toknowpress.net/ISBN/978-961-6914-13-0/MakeLearn2015.pdf
File Function: Conference Programme
Download Restriction: no
---><---

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:tkp:mklp15:583-591. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Maks Jezovnik (email available below). General contact details of provider: http://www.toknowpress.net/proceedings/978-961-6914-13-0/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.