IDEAS home Printed from https://ideas.repec.org/a/spr/stmapp/v34y2025i1d10.1007_s10260-025-00779-z.html
   My bibliography  Save this article

Addressing topic modelling via reduced latent space clustering

Author

Listed:
  • Lorenzo Schiavon

    (Ca’ Foscari University of Venice, San Giobbe)

Abstract

In the social sciences, topic modelling is gaining increased attention for its ability to automatically uncover the underlying themes within large corpora of textual data. This process typically involves two key phases: (i) identifying the words associated with language concepts, and (ii) clustering documents that share similar word distributions. In this study, motivated by the growing interest in automatic categorisation of policy documents and regulations, we leverage recent advancements in Bayesian factor models to develop a novel topic modelling approach. This enable us to represent the high-dimensional space defined by all possible observed words through a small set of latent variables, and simultaneously cluster the documents based on their distributions over these latent constructs. Here, groups and underlying constructs are interpreted as document topics and language concepts, respectively, with the number of dimensions not required in advance. Additionally, we demonstrate the effectiveness of our approach using synthetic data, providing a comparison with existing methods in the literature. The illustration of our approach on a corpus of Italian health public plans unveils intriguing patterns concerning the semantic structures used in ageing policies and document topic similarities.

Suggested Citation

  • Lorenzo Schiavon, 2025. "Addressing topic modelling via reduced latent space clustering," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 34(1), pages 1-20, March.
  • Handle: RePEc:spr:stmapp:v:34:y:2025:i:1:d:10.1007_s10260-025-00779-z
    DOI: 10.1007/s10260-025-00779-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10260-025-00779-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10260-025-00779-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. repec:bfi:wpaper:2014-014 is not listed on IDEAS
    2. Conti, Gabriella & Frühwirth-Schnatter, Sylvia & Heckman, James J. & Piatek, Rémi, 2014. "Bayesian exploratory factor analysis," Journal of Econometrics, Elsevier, vol. 183(1), pages 31-57.
    3. Sirio Legramanti & Daniele Durante & David B Dunson, 2020. "Bayesian cumulative shrinkage for infinite factorizations," Biometrika, Biometrika Trust, vol. 107(3), pages 745-752.
    4. Nicholas G. Polson & James G. Scott & Jesse Windle, 2013. "Bayesian Inference for Logistic Models Using Pólya--Gamma Latent Variables," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(504), pages 1339-1349, December.
    5. Aßmann, Christian & Boysen-Hogrefe, Jens & Pape, Markus, 2016. "Bayesian analysis of static and dynamic factor models: An ex-post approach towards the rotation problem," Journal of Econometrics, Elsevier, vol. 192(1), pages 190-206.
    6. Veronika Ročková & Edward I. George, 2016. "Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1608-1622, October.
    7. Grün, Bettina & Hornik, Kurt, 2011. "topicmodels: An R Package for Fitting Topic Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i13).
    8. Danny Valdez & Andrew C. Pickett & Patricia Goodson, 2018. "Topic Modeling: Latent Semantic Analysis for the Social Sciences," Social Science Quarterly, Southwestern Social Science Association, vol. 99(5), pages 1665-1679, November.
    9. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    10. Daniele Durante, 2019. "Conjugate Bayes for probit regression via unified skew-normal distributions," Biometrika, Biometrika Trust, vol. 106(4), pages 765-779.
    11. Anirban Bhattacharya & Debdeep Pati & Natesh S. Pillai & David B. Dunson, 2015. "Dirichlet--Laplace Priors for Optimal Shrinkage," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1479-1490, December.
    12. A. Bhattacharya & D. B. Dunson, 2011. "Sparse Bayesian infinite factor models," Biometrika, Biometrika Trust, vol. 98(2), pages 291-306.
    13. L Schiavon & A Canale & D B Dunson, 2022. "Generalized infinite factorization models [A latent factor linear mixed model for high-dimensional longitudinal data analysis]," Biometrika, Biometrika Trust, vol. 109(3), pages 817-835.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Simon Beyeler & Sylvia Kaufmann, 2021. "Reduced‐form factor augmented VAR—Exploiting sparsity to include meaningful factors," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 36(7), pages 989-1012, November.
    2. Dimitris Korobilis & Kenichi Shimizu, 2022. "Bayesian Approaches to Shrinkage and Sparse Estimation," Foundations and Trends(R) in Econometrics, now publishers, vol. 11(4), pages 230-354, June.
    3. Sylvia Frühwirth-Schnatter & Darjus Hosszejni & Hedibert Freitas Lopes, 2023. "When It Counts—Econometric Identification of the Basic Factor Model Based on GLT Structures," Econometrics, MDPI, vol. 11(4), pages 1-30, November.
    4. Kaufmann, Sylvia & Schumacher, Christian, 2019. "Bayesian estimation of sparse dynamic factor models with order-independent and ex-post mode identification," Journal of Econometrics, Elsevier, vol. 210(1), pages 116-134.
    5. Sylvia Fruhwirth-Schnatter, 2023. "Generalized Cumulative Shrinkage Process Priors with Applications to Sparse Bayesian Factor Analysis," Papers 2303.00473, arXiv.org.
    6. Adrian Quintero & Emmanuel Lesaffre & Geert Verbeke, 2024. "Bayesian Exploratory Factor Analysis via Gibbs Sampling," Journal of Educational and Behavioral Statistics, , vol. 49(1), pages 121-142, February.
    7. Nolan, Tui H. & Richardson, Sylvia & Ruffieux, Hélène, 2025. "Efficient Bayesian functional principal component analysis of irregularly-observed multivariate curves," Computational Statistics & Data Analysis, Elsevier, vol. 203(C).
    8. L Schiavon & A Canale & D B Dunson, 2022. "Generalized infinite factorization models [A latent factor linear mixed model for high-dimensional longitudinal data analysis]," Biometrika, Biometrika Trust, vol. 109(3), pages 817-835.
    9. Mohsen Maleki & Darren Wraith, 2019. "Mixtures of multivariate restricted skew-normal factor analyzer models in a Bayesian framework," Computational Statistics, Springer, vol. 34(3), pages 1039-1053, September.
    10. Darjus Hosszejni & Sylvia Fruhwirth-Schnatter, 2022. "Cover It Up! Bipartite Graphs Uncover Identifiability in Sparse Factor Analysis," Papers 2211.00671, arXiv.org, revised Feb 2025.
    11. Hauzenberger, Niko & Huber, Florian & Klieber, Karin & Marcellino, Massimiliano, 2025. "Bayesian neural networks for macroeconomic analysis," Journal of Econometrics, Elsevier, vol. 249(PC).
    12. Gregor Kastner & Sylvia Fruhwirth-Schnatter & Hedibert Freitas Lopes, 2016. "Efficient Bayesian Inference for Multivariate Factor Stochastic Volatility Models," Papers 1602.08154, arXiv.org, revised Jul 2017.
    13. Alessandro Casa & Andrea Cappozzo & Michael Fop, 2022. "Group-Wise Shrinkage Estimation in Penalized Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 648-674, November.
    14. Matthew W. Wheeler, 2019. "Bayesian additive adaptive basis tensor product models for modeling high dimensional surfaces: an application to high‐throughput toxicity testing," Biometrics, The International Biometric Society, vol. 75(1), pages 193-201, March.
    15. Joshua C. C. Chan, 2024. "BVARs and stochastic volatility," Chapters, in: Michael P. Clements & Ana Beatriz Galvão (ed.), Handbook of Research Methods and Applications in Macroeconomic Forecasting, chapter 3, pages 43-67, Edward Elgar Publishing.
    16. Marco, Nicholas & Şentürk, Damla & Jeste, Shafali & DiStefano, Charlotte C. & Dickinson, Abigail & Telesca, Donatello, 2024. "Flexible regularized estimation in high-dimensional mixed membership models," Computational Statistics & Data Analysis, Elsevier, vol. 194(C).
    17. Florian Huber & Gary Koop, 2023. "Subspace shrinkage in conjugate Bayesian vector autoregressions," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 38(4), pages 556-576, June.
    18. Sylvia Frühwirth-Schnatter & Gertraud Malsiner-Walli, 2019. "From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 33-64, March.
    19. Samorodnitsky, Sarah & Wendt, Chris H. & Lock, Eric F., 2024. "Bayesian simultaneous factorization and prediction using multi-omic data," Computational Statistics & Data Analysis, Elsevier, vol. 197(C).
    20. Simon Beyeler & Sylvia Kaufmann, 2016. "Factor augmented VAR revisited - A sparse dynamic factor model approach," Working Papers 16.08, Swiss National Bank, Study Center Gerzensee.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:stmapp:v:34:y:2025:i:1:d:10.1007_s10260-025-00779-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.