A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling

My bibliography Save this article

A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling

Author

Listed:

Arun Kumar Yadav
(NIT Hamirpur (HP))
Tushar Gupta
(NIT Hamirpur (HP))
Mohit Kumar
(NIT Hamirpur (HP))
Divakar Yadav
(SOCIS, IGNOU)

Registered:

Abstract

Topic modeling is a popular machine learning technique in natural language processing for identifying themes within unstructured text. One of the most prominent methods for this purpose is Latent Dirichlet Allocation (LDA), which can automatically uncover topics from large text corpora. However, LDA alone may not always provide the best results. Using Bidirectional Encoder Representations from Transformers (BERT) embeddings in topic modeling, significantly enhances the quality and coherence of discovered topics by leveraging deep contextual representations of words. Clustering is another powerful unsupervised machine learning technique frequently used for topic modeling and information extraction from unstructured text. This study introduces a hybrid approach that combines LDA with BERT for enhanced topic modeling, incorporating dimensionality reduction-based clustering. To manage the increasing complexity and computational load of clustering with many features, Uniform Manifold Approximation and Projection is utilized for dimensionality reduction. Experiments conducted on benchmark datasets, specifically Reuters-21578 and 20newsgroups, illustrate the effectiveness of this cluster-informed topic modeling framework. The empirical results suggest that integrating clustering with BERT-LDA for topic modeling can be highly effective, as dimensionality reduction via clustering helps derive more cohesive topics. The study evaluates coherence scores using the BERT-LDA model on the 20newsgroups and Reuters datasets. For the 20newsgroups dataset, BERT-LDA shows a significant improvement in coherence scores: nearly 59% for 10 topics, 42% for 20 topics, 11% for 50 topics, and 16% for 98 topics. Similarly, for the Reuters dataset, coherence scores improved by about 85% for 10 topics, 63% for 20 topics, 43% for 50 topics, and 41% for 98 topics. These results highlight how BERT-LDA enhances topic coherence compared to traditional models.

Suggested Citation

Arun Kumar Yadav & Tushar Gupta & Mohit Kumar & Divakar Yadav, 2025. "A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling," Quality & Quantity: International Journal of Methodology, Springer, vol. 59(3), pages 2381-2408, June.

Handle: RePEc:spr:qualqt:v:59:y:2025:i:3:d:10.1007_s11135-025-02077-y
DOI: 10.1007/s11135-025-02077-y

Download full text from publisher

As the access to this document is restricted, you may want to

for a different version of it.

More about this item

Keywords

; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:59:y:2025:i:3:d:10.1007_s11135-025-02077-y. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling

Author

Abstract

Suggested Citation

Download full text from publisher

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data