Author
Listed:
- Arun Kumar Yadav
(NIT Hamirpur (HP))
- Tushar Gupta
(NIT Hamirpur (HP))
- Mohit Kumar
(NIT Hamirpur (HP))
- Divakar Yadav
(SOCIS, IGNOU)
Abstract
Topic modeling is a popular machine learning technique in natural language processing for identifying themes within unstructured text. One of the most prominent methods for this purpose is Latent Dirichlet Allocation (LDA), which can automatically uncover topics from large text corpora. However, LDA alone may not always provide the best results. Using Bidirectional Encoder Representations from Transformers (BERT) embeddings in topic modeling, significantly enhances the quality and coherence of discovered topics by leveraging deep contextual representations of words. Clustering is another powerful unsupervised machine learning technique frequently used for topic modeling and information extraction from unstructured text. This study introduces a hybrid approach that combines LDA with BERT for enhanced topic modeling, incorporating dimensionality reduction-based clustering. To manage the increasing complexity and computational load of clustering with many features, Uniform Manifold Approximation and Projection is utilized for dimensionality reduction. Experiments conducted on benchmark datasets, specifically Reuters-21578 and 20newsgroups, illustrate the effectiveness of this cluster-informed topic modeling framework. The empirical results suggest that integrating clustering with BERT-LDA for topic modeling can be highly effective, as dimensionality reduction via clustering helps derive more cohesive topics. The study evaluates coherence scores using the BERT-LDA model on the 20newsgroups and Reuters datasets. For the 20newsgroups dataset, BERT-LDA shows a significant improvement in coherence scores: nearly 59% for 10 topics, 42% for 20 topics, 11% for 50 topics, and 16% for 98 topics. Similarly, for the Reuters dataset, coherence scores improved by about 85% for 10 topics, 63% for 20 topics, 43% for 50 topics, and 41% for 98 topics. These results highlight how BERT-LDA enhances topic coherence compared to traditional models.
Suggested Citation
Arun Kumar Yadav & Tushar Gupta & Mohit Kumar & Divakar Yadav, 2025.
"A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling,"
Quality & Quantity: International Journal of Methodology, Springer, vol. 59(3), pages 2381-2408, June.
Handle:
RePEc:spr:qualqt:v:59:y:2025:i:3:d:10.1007_s11135-025-02077-y
DOI: 10.1007/s11135-025-02077-y
Download full text from publisher
As the access to this document is restricted, you may want to
for a different version of it.
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:59:y:2025:i:3:d:10.1007_s11135-025-02077-y. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.