IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2503.02741.html
   My bibliography  Save this paper

Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models

Author

Listed:
  • Bernd Prostmaier
  • Jan V'avra
  • Bettina Grun
  • Paul Hofmarcher

Abstract

Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with predefined conceptual domains. This paper introduces Seeded Poisson Factorization (SPF), a novel approach that extends the Poisson Factorization framework by incorporating domain knowledge through seed words. SPF enables a more interpretable and structured topic discovery by modifying the prior distribution of topic-specific term intensities, assigning higher initial rates to predefined seed words. The model is estimated using variational inference with stochastic gradient optimization, ensuring scalability to large datasets. We apply SPF to an Amazon customer feedback dataset, leveraging predefined product categories as guiding structures. Our evaluation demonstrates that SPF achieves superior classification performance compared to alternative guided topic models, particularly in terms of computational efficiency and predictive performance. Furthermore, robustness checks highlight SPF's ability to adaptively balance domain knowledge and data-driven topic discovery, even in cases of imperfect seed word selection. These results establish SPF as a powerful and scalable alternative for integrating expert knowledge into topic modeling, enhancing both interpretability and efficiency in real-world applications.

Suggested Citation

  • Bernd Prostmaier & Jan V'avra & Bettina Grun & Paul Hofmarcher, 2025. "Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models," Papers 2503.02741, arXiv.org.
  • Handle: RePEc:arx:papers:2503.02741
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2503.02741
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Evan Munro & Serena Ng, 2022. "Latent Dirichlet Analysis of Categorical Survey Responses," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 40(1), pages 256-271, January.
    2. David M. Blei & Alp Kucukelbir & Jon D. McAuliffe, 2017. "Variational Inference: A Review for Statisticians," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 859-877, April.
    3. Leif Anders Thorsrud, 2020. "Words are the New Numbers: A Newsy Coincident Index of the Business Cycle," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 38(2), pages 393-409, April.
    4. Barberá, Pablo & Casas, Andreu & Nagler, Jonathan & Egan, Patrick J. & Bonneau, Richard & Jost, John T. & Tucker, Joshua A., 2019. "Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data," American Political Science Review, Cambridge University Press, vol. 113(4), pages 883-901, November.
    5. repec:cup:apsrev:v:113:y:2019:i:04:p:883-901_00 is not listed on IDEAS
    6. Navid Aghakhani & Onook Oh & Dawn G. Gregg & Jahangir Karimi, 2021. "Online Review Consistency Matters: An Elaboration Likelihood Model Perspective," Information Systems Frontiers, Springer, vol. 23(5), pages 1287-1301, September.
    7. R. Filieri & Fraser Mcleay & Bruce Tsui & Zhibin Lin, 2018. "Consumer perceptions of information helpfulness and determinants of purchase intention in online consumer reviews of services," Post-Print hal-04779103, HAL.
    8. Margaret E. Roberts & Brandon M. Stewart & Dustin Tingley & Christopher Lucas & Jetson Leder‐Luis & Shana Kushner Gadarian & Bethany Albertson & David G. Rand, 2014. "Structural Topic Models for Open‐Ended Survey Responses," American Journal of Political Science, John Wiley & Sons, vol. 58(4), pages 1064-1082, October.
    9. Bagozzi, Benjamin E. & Berliner, Daniel, 2018. "The Politics of Scrutiny in Human Rights Monitoring: Evidence from Structural Topic Models of US State Department Human Rights Reports," Political Science Research and Methods, Cambridge University Press, vol. 6(4), pages 661-677, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Laura Battaglia & Timothy M. Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for regression with variables generated from unstructured data," CeMMAP working papers 10/24, Institute for Fiscal Studies.
    2. Szymon Sacher & Laura Battaglia & Stephen Hansen, 2021. "Hamiltonian Monte Carlo for Regression with High-Dimensional Categorical Data," Papers 2107.08112, arXiv.org, revised Feb 2024.
    3. Laura Battaglia & Timothy Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for Regression with Variables Generated by AI or Machine Learning," Papers 2402.15585, arXiv.org, revised Apr 2025.
    4. Kübler, Raoul V. & Manke, Kai & Pauwels, Koen, 2025. "I like, I share, I vote: Mapping the dynamic system of political marketing," Journal of Business Research, Elsevier, vol. 186(C).
    5. Zhang, Han, 2021. "How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It," SocArXiv 453jk, Center for Open Science.
    6. Andreas Rehs, 2020. "A structural topic model approach to scientific reorientation of economics and chemistry after German reunification," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1229-1251, November.
    7. Xieling Chen & Juan Chen & Gary Cheng & Tao Gong, 2020. "Topics and trends in artificial intelligence assisted human brain research," PLOS ONE, Public Library of Science, vol. 15(4), pages 1-27, April.
    8. Ebadi, Ashkan & Tremblay, Stéphane & Goutte, Cyril & Schiffauerova, Andrea, 2020. "Application of machine learning techniques to assess the trends and alignment of the funded research output," Journal of Informetrics, Elsevier, vol. 14(2).
    9. Sandra Wankmüller, 2023. "A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis," Journal of Computational Social Science, Springer, vol. 6(1), pages 91-163, April.
    10. Marc Burri & Daniel Kaufmann, 2020. "A daily fever curve for the Swiss economy," Swiss Journal of Economics and Statistics, Springer;Swiss Society of Economics and Statistics, vol. 156(1), pages 1-11, December.
    11. Everett, Jeff & Shiraz Rahaman, Abu & Neu, Dean & Saxton, Gregory, 2024. "Letters to the editor, institutional experimentation, and the public accounting professional," CRITICAL PERSPECTIVES ON ACCOUNTING, Elsevier, vol. 99(C).
    12. Minchul Lee & Min Song, 2020. "Incorporating citation impact into analysis of research trends," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(2), pages 1191-1224, August.
    13. Shen Liu & Hongyan Liu, 2021. "Tagging Items Automatically Based on Both Content Information and Browsing Behaviors," INFORMS Journal on Computing, INFORMS, vol. 33(3), pages 882-897, July.
    14. Benoit Aubert & Jane Li & Markus Luczak-Roesch & Thierry Warin, 2021. "La détermination des agendas de discussion par les médias sociaux," CIRANO Project Reports 2021rp-12, CIRANO.
    15. Grajzl, Peter & Murrell, Peter, 2021. "A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution I: generating and interpreting the estimates," Journal of Institutional Economics, Cambridge University Press, vol. 17(1), pages 1-19, February.
    16. Loaiza-Maya, Rubén & Smith, Michael Stanley & Nott, David J. & Danaher, Peter J., 2022. "Fast and accurate variational inference for models with many latent variables," Journal of Econometrics, Elsevier, vol. 230(2), pages 339-362.
    17. Leif Anders Thorsrud, 2016. "Nowcasting using news topics Big Data versus big bank," Working Papers No 6/2016, Centre for Applied Macro- and Petroleum economics (CAMP), BI Norwegian Business School.
    18. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    19. Luo, Nanyu & Ji, Feng & Han, Yuting & He, Jinbo & Zhang, Xiaoya, 2024. "Fitting item response theory models using deep learning computational frameworks," OSF Preprints tjxab, Center for Open Science.
    20. Marcel Fratzscher & Tobias Heidland & Lukas Menkhoff & Lucio Sarno & Maik Schmeling, 2023. "Foreign Exchange Intervention: A New Database," IMF Economic Review, Palgrave Macmillan;International Monetary Fund, vol. 71(4), pages 852-884, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2503.02741. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.