IDEAS home Printed from https://ideas.repec.org/a/cys/ecocyb/v50y2016i2p197-210.html
   My bibliography  Save this article

Assessing the Performance of Compression Based Clustering for Text Mining

Author

Listed:
  • Alexandra CERNIAN

    (“Politehnica” University of Bucharest)

  • Dorin CARSTOIU

    (“Politehnica” University of Bucharest)

  • Adriana OLTEANU

    (“Politehnica” University of Bucharest)

  • Valentin SGARCIU

    (“Politehnica” University of Bucharest)

Abstract

The nature of the human brain is to find patterns in whatever surrounds us. Thus, we are all developing models of our personal universe. In an extended form, a constant preoccupation of philosophers has been to model the universe. Clustering is one of the most useful tools in the data mining process for discovering groups and identifying patterns in the underlying data. This paper addresses the compression based clustering approach and focuses on validating this method in the context of text mining. The idea is supported by the evidence that compression algorithms provide a good evaluation of the informational content. In this context, we developed an integrated clustering platform, called EasyClustering, which incorporates 3 compressors, 4 distance metrics and 3 clustering algorithms. The experimental validation presented in this paper focuses on clustering text documents based on informational content.

Suggested Citation

  • Alexandra CERNIAN & Dorin CARSTOIU & Adriana OLTEANU & Valentin SGARCIU, 2016. "Assessing the Performance of Compression Based Clustering for Text Mining," ECONOMIC COMPUTATION AND ECONOMIC CYBERNETICS STUDIES AND RESEARCH, Faculty of Economic Cybernetics, Statistics and Informatics, vol. 50(2), pages 197-210.
  • Handle: RePEc:cys:ecocyb:v:50:y:2016:i:2:p:197-210
    as

    Download full text from publisher

    File URL: ftp://www.eadr.ro/RePEc/cys/ecocyb_pdf/ecocyb2_2016p197-210.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Lian Duan & Lida Xu & Ying Liu & Jun Lee, 2009. "Cluster-based outlier detection," Annals of Operations Research, Springer, vol. 168(1), pages 151-168, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Philippe Bernard & Najat El Mekkaoui De Freitas & Bertrand B. Maillet, 2022. "A financial fraud detection indicator for investors: an IDeA," Annals of Operations Research, Springer, vol. 313(2), pages 809-832, June.
    2. Marek Śmieja & Magdalena Wiercioch, 2017. "Constrained clustering with a complex cluster structure," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 11(3), pages 493-518, September.
    3. Andrea Flori & Simone Giansante & Claudia Girardone & Fabio Pammolli, 2021. "Banks’ business strategies on the edge of distress," Annals of Operations Research, Springer, vol. 299(1), pages 481-530, April.
    4. Aielli, Gian Piero & Caporin, Massimiliano, 2013. "Fast clustering of GARCH processes via Gaussian mixture models," Mathematics and Computers in Simulation (MATCOM), Elsevier, vol. 94(C), pages 205-222.
    5. Ali Tosyali & Jinho Kim & Jeongsub Choi & Yunyi Kang & Myong K. Jeong, 2020. "New node anomaly detection algorithm based on nonnegative matrix factorization for directed citation networks," Annals of Operations Research, Springer, vol. 288(1), pages 457-474, May.
    6. Hae-Sang Park & Jeonghwa Lee & Chi-Hyuck Jun, 2014. "Clustering noise-included data by controlling decision errors," Annals of Operations Research, Springer, vol. 216(1), pages 129-144, May.
    7. Rakin Abrar & Showmitra Kumar Sarkar & Kashfia Tasnim Nishtha & Swapan Talukdar & Shahfahad & Atiqur Rahman & Abu Reza Md Towfiqul Islam & Amir Mosavi, 2022. "Assessing the Spatial Mapping of Heat Vulnerability under Urban Heat Island (UHI) Effect in the Dhaka Metropolitan Area," Sustainability, MDPI, vol. 14(9), pages 1-24, April.
    8. Behnam Tavakkol & Myong K. Jeong & Susan L. Albin, 2021. "Validity indices for clusters of uncertain data objects," Annals of Operations Research, Springer, vol. 303(1), pages 321-357, August.
    9. Shouhui Pan & Li Wang & Kaiyi Wang & Zhuming Bi & Siqing Shan & Bo Xu, 2014. "A Knowledge Engineering Framework for Identifying Key Impact Factors from Safety‐Related Accident Cases," Systems Research and Behavioral Science, Wiley Blackwell, vol. 31(3), pages 383-397, May.
    10. Farnè, Matteo & Vouldis, Angelos T., 2018. "A methodology for automised outlier detection in high-dimensional datasets: an application to euro area banks' supervisory data," Working Paper Series 2171, European Central Bank.

    More about this item

    Keywords

    clustering; compression; text mining; EasyClustering; FScore.;
    All these keywords.

    JEL classification:

    • O30 - Economic Development, Innovation, Technological Change, and Growth - - Innovation; Research and Development; Technological Change; Intellectual Property Rights - - - General

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:cys:ecocyb:v:50:y:2016:i:2:p:197-210. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Corina Saman (email available below). General contact details of provider: https://edirc.repec.org/data/feasero.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.