IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v12y2018i3d10.1007_s11634-017-0280-3.html
   My bibliography  Save this article

Cluster-based sparse topical coding for topic mining and document clustering

Author

Listed:
  • Parvin Ahmadi

    (Sharif University of Technology)

  • Iman Gholampour

    (Sharif University of Technology)

  • Mahmoud Tabandeh

    (Sharif University of Technology)

Abstract

In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and projecting them into a topic space. The latent semantic descriptions derived by the topic model can be utilized as features in a clustering process. In our proposed method, document clustering and topic modeling are integrated in a unified framework in order to achieve the highest performance. This framework includes Sparse Topical Coding, which is responsible for topic mining, and K-means that discovers the latent clusters in documents collection. Experimental results on widely-used datasets show that our proposed method significantly outperforms the traditional and other topic model based clustering methods. Our method achieves from 4 to 39% improvement in clustering accuracy and from 2% to more than 44% improvement in normalized mutual information.

Suggested Citation

  • Parvin Ahmadi & Iman Gholampour & Mahmoud Tabandeh, 2018. "Cluster-based sparse topical coding for topic mining and document clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 537-558, September.
  • Handle: RePEc:spr:advdac:v:12:y:2018:i:3:d:10.1007_s11634-017-0280-3
    DOI: 10.1007/s11634-017-0280-3
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-017-0280-3
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-017-0280-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Jean-Charles Lamirel, 2012. "A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research," Scientometrics, Springer;Akadémiai Kiadó, vol. 93(1), pages 151-166, October.
    2. Teh, Yee Whye & Jordan, Michael I. & Beal, Matthew J. & Blei, David M., 2006. "Hierarchical Dirichlet Processes," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1566-1581, December.
    3. H. W. Kuhn, 1955. "The Hungarian method for the assignment problem," Naval Research Logistics Quarterly, John Wiley & Sons, vol. 2(1‐2), pages 83-97, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Karina Gibert & Yaroslav Hernandez-Potiomkin, 2023. "A Unified Formal Framework for Factorial and Probabilistic Topic Modelling," Mathematics, MDPI, vol. 11(20), pages 1-27, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Redivo, Edoardo & Nguyen, Hien D. & Gupta, Mayetri, 2020. "Bayesian clustering of skewed and multimodal data using geometric skewed normal distributions," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    2. András Frank, 2005. "On Kuhn's Hungarian Method—A tribute from Hungary," Naval Research Logistics (NRL), John Wiley & Sons, vol. 52(1), pages 2-5, February.
    3. Amit Kumar & Anila Gupta, 2013. "Mehar’s methods for fuzzy assignment problems with restrictions," Fuzzy Information and Engineering, Springer, vol. 5(1), pages 27-44, March.
    4. Jeffrey L. Furman & Florenta Teodoridis, 2020. "Automation, Research Technology, and Researchers’ Trajectories: Evidence from Computer Science and Electrical Engineering," Organization Science, INFORMS, vol. 31(2), pages 330-354, March.
    5. Shu-Ping Shi & Yong Song, 2012. "Identifying Speculative Bubbles with an Infinite Hidden Markov Model," Working Paper series 26_12, Rimini Centre for Economic Analysis.
    6. Jin, Xin & Maheu, John M. & Yang, Qiao, 2022. "Infinite Markov pooling of predictive distributions," Journal of Econometrics, Elsevier, vol. 228(2), pages 302-321.
    7. Chenchen Ma & Jing Ouyang & Gongjun Xu, 2023. "Learning Latent and Hierarchical Structures in Cognitive Diagnosis Models," Psychometrika, Springer;The Psychometric Society, vol. 88(1), pages 175-207, March.
    8. Gustaf Bellstam & Sanjai Bhagat & J. Anthony Cookson, 2021. "A Text-Based Analysis of Corporate Innovation," Management Science, INFORMS, vol. 67(7), pages 4004-4031, July.
    9. Tran Hoang Hai, 2020. "Estimation of volatility causality in structural autoregressions with heteroskedasticity using independent component analysis," Statistical Papers, Springer, vol. 61(1), pages 1-16, February.
    10. Caplin, Andrew & Leahy, John, 2020. "Comparative statics in markets for indivisible goods," Journal of Mathematical Economics, Elsevier, vol. 90(C), pages 80-94.
    11. Biró, Péter & Gudmundsson, Jens, 2021. "Complexity of finding Pareto-efficient allocations of highest welfare," European Journal of Operational Research, Elsevier, vol. 291(2), pages 614-628.
    12. Hassan Akell & Farkhondeh-Alsadat Sajadi & Iraj Kazemi, 2023. "Construction of Jointly Distributed Random Samples Drawn from the Beta Two-Parameter Process," Methodology and Computing in Applied Probability, Springer, vol. 25(3), pages 1-12, September.
    13. Péter Biró & Flip Klijn & Xenia Klimentova & Ana Viana, 2021. "Shapley-Scarf Housing Markets: Respecting Improvement, Integer Programming, and Kidney Exchange," Working Papers 1235, Barcelona School of Economics.
    14. Michal Brylinski, 2014. "eMatchSite: Sequence Order-Independent Structure Alignments of Ligand Binding Pockets in Protein Models," PLOS Computational Biology, Public Library of Science, vol. 10(9), pages 1-15, September.
    15. Hongxia Yang & Aurelie Lozano, 2015. "Multi-relational learning via hierarchical nonparametric Bayesian collective matrix factorization," Journal of Applied Statistics, Taylor & Francis Journals, vol. 42(5), pages 1133-1147, May.
    16. Bauwens, Luc & Dufays, Arnaud & Rombouts, Jeroen V.K., 2014. "Marginal likelihood for Markov-switching and change-point GARCH models," Journal of Econometrics, Elsevier, vol. 178(P3), pages 508-522.
    17. Robert M. Dorazio & Bhramar Mukherjee & Li Zhang & Malay Ghosh & Howard L. Jelks & Frank Jordan, 2008. "Modeling Unobserved Sources of Heterogeneity in Animal Abundance Using a Dirichlet Process Prior," Biometrics, The International Biometric Society, vol. 64(2), pages 635-644, June.
    18. Chiwei Yan & Helin Zhu & Nikita Korolko & Dawn Woodard, 2020. "Dynamic pricing and matching in ride‐hailing platforms," Naval Research Logistics (NRL), John Wiley & Sons, vol. 67(8), pages 705-724, December.
    19. Fanrong Xie & Anuj Sharma & Zuoan Li, 2022. "An alternate approach to solve two-level priority based assignment problem," Computational Optimization and Applications, Springer, vol. 81(2), pages 613-656, March.
    20. Jeong Hwan Kook & Michele Guindani & Linlin Zhang & Marina Vannucci, 2019. "NPBayes-fMRI: Non-parametric Bayesian General Linear Models for Single- and Multi-Subject fMRI Data," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 11(1), pages 3-21, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:12:y:2018:i:3:d:10.1007_s11634-017-0280-3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.