IDEAS home Printed from https://ideas.repec.org/a/taf/jnlasa/v111y2016i516p1381-1403.html
   My bibliography  Save this article

Improving and Evaluating Topic Models and Other Models of Text

Author

Listed:
  • Edoardo M. Airoldi
  • Jonathan M. Bischof

Abstract

An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models.

Suggested Citation

  • Edoardo M. Airoldi & Jonathan M. Bischof, 2016. "Improving and Evaluating Topic Models and Other Models of Text," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1381-1403, October.
  • Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1381-1403
    DOI: 10.1080/01621459.2015.1051182
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/01621459.2015.1051182
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/01621459.2015.1051182?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Puklavec, Žiga & Kogler, Christoph & Stavrova, Olga & Zeelenberg, Marcel, 2023. "What we tweet about when we tweet about taxes: A topic modelling approach," Journal of Economic Behavior & Organization, Elsevier, vol. 212(C), pages 1242-1254.
    2. Federico Maria Ferrara & Jörg S Haas & Andrew Peterson & Thomas Sattler, 2022. "Exports vs. Investment: How Public Discourse Shapes Support for External Imbalances," Post-Print hal-02569351, HAL.
    3. Dehler-Holland, Joris & Okoh, Marvin & Keles, Dogan, 2022. "Assessing technology legitimacy with topic models and sentiment analysis – The case of wind power in Germany," Technological Forecasting and Social Change, Elsevier, vol. 175(C).
    4. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    5. Berk Wheelock, Lauren & Pachamanova, Dessislava A., 2022. "Acceptable set topic modeling," European Journal of Operational Research, Elsevier, vol. 299(2), pages 653-673.
    6. Tamakloe, Reuben & Park, Dongjoo, 2023. "Discovering latent topics and trends in autonomous vehicle-related research: A structural topic modelling approach," Transport Policy, Elsevier, vol. 139(C), pages 1-20.
    7. Ingrid Ott & Simone Vannuccini, 2023. "Invention in Times of Global Challenges: A Text-Based Study of Remote Sensing and Global Public Goods," Economies, MDPI, vol. 11(8), pages 1-24, August.
    8. Valerio Astuti & Marta Crispino & Marco Langiulli & Juri Marcucci, 2022. "Textual analysis of a Twitter corpus during the COVID-19 pandemics," Questioni di Economia e Finanza (Occasional Papers) 692, Bank of Italy, Economic Research and International Relations Area.
    9. Ristolainen, Kim & Roukka, Tomi & Nyberg, Henri, 2024. "A thousand words tell more than just numbers: Financial crises and historical headlines," Journal of Financial Stability, Elsevier, vol. 70(C).
    10. Federico M. Ferrara & Donato Masciandaro & Manuela Moschella & Davide Romelli, 2021. "Political Voice on Monetary Policy: Evidence from the Parliamentary Hearings of the European Central Bank," BAFFI CAREFIN Working Papers 21159, BAFFI CAREFIN, Centre for Applied Research on International Markets Banking Finance and Regulation, Universita' Bocconi, Milano, Italy.
    11. Ferrara, Federico M. & Masciandaro, Donato & Moschella, Manuela & Romelli, Davide, 2022. "Political voice on monetary policy: Evidence from the parliamentary hearings of the European Central Bank," European Journal of Political Economy, Elsevier, vol. 74(C).
    12. Larsen, Vegard H. & Thorsrud, Leif A., 2019. "The value of news for economic developments," Journal of Econometrics, Elsevier, vol. 210(1), pages 203-218.
    13. Mengbing Li & Daniel E. Park & Maliha Aziz & Cindy M. Liu & Lance B. Price & Zhenke Wu, 2023. "Integrating sample similarities into latent class analysis: a tree‐structured shrinkage approach," Biometrics, The International Biometric Society, vol. 79(1), pages 264-279, March.
    14. Justyna Klejdysz & Robin L. Lumsdaine, 2023. "Shifts in ECB Communication: A Textual Analysis of the Press Conference," International Journal of Central Banking, International Journal of Central Banking, vol. 19(2), pages 473-542, June.
    15. Carlos Mendez & Fernando Mendez & Vasiliki Triga & Juan Miguel Carrascosa, 2020. "EU Cohesion Policy under the Media Spotlight: Exploring Territorial and Temporal Patterns in News Coverage and Tone," Journal of Common Market Studies, Wiley Blackwell, vol. 58(4), pages 1034-1055, July.
    16. Choi, Hyunhong & Woo, JongRoul, 2022. "Investigating emerging hydrogen technology topics and comparing national level technological focus: Patent analysis using a structural topic model," Applied Energy, Elsevier, vol. 313(C).
    17. Nobuyuki Hanaki & Ali I. Ozkes, 2023. "Strategic environment effect and communication," Experimental Economics, Springer;Economic Science Association, vol. 26(3), pages 588-621, July.
    18. Erzurumlu, S. Sinan & Pachamanova, Dessislava, 2020. "Topic modeling and technology forecasting for assessing the commercial viability of healthcare innovations," Technological Forecasting and Social Change, Elsevier, vol. 156(C).
    19. Federico Maria Ferrara & Jörg Haas & Andrew Peterson & Thomas Sattler, 2020. "Exports vs. Investment: How Public Discourse Shapes Support for External Imbalances ," Working Papers hal-02569351, HAL.
    20. Camilla Salvatore & Silvia Biffignandi & Annamaria Bianchi, 2022. "Corporate Social Responsibility Activities Through Twitter: From Topic Model Analysis to Indexes Measuring Communication Characteristics," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 164(3), pages 1217-1248, December.
    21. Jef Vlegels & Stijn Daenekindt, 2021. "Combining topic models with bipartite blockmodelling to uncover the multifaceted nature of social capital," PLOS ONE, Public Library of Science, vol. 16(6), pages 1-15, June.
    22. Salvatore Pirri & Valentina Lorenzoni & Gianni Andreozzi & Marta Mosca & Giuseppe Turchetti, 2020. "Topic Modeling and User Network Analysis on Twitter during World Lupus Awareness Day," IJERPH, MDPI, vol. 17(15), pages 1-18, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1381-1403. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/UASA20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.