IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0266325.html
   My bibliography  Save this article

Topic modeling revisited: New evidence on algorithm performance and quality metrics

Author

Listed:
  • Matthias Rüdiger
  • David Antons
  • Amol M Joshi
  • Torsten-Oliver Salge

Abstract

Topic modeling is a popular technique for exploring large document collections. It has proven useful for this task, but its application poses a number of challenges. First, the comparison of available algorithms is anything but simple, as researchers use many different datasets and criteria for their evaluation. A second challenge is the choice of a suitable metric for evaluating the calculated results. The metrics used so far provide a mixed picture, making it difficult to verify the accuracy of topic modeling outputs. Altogether, the choice of an appropriate algorithm and the evaluation of the results remain unresolved issues. Although many studies have reported promising performance by various topic models, prior research has not yet systematically investigated the validity of the outcomes in a comprehensive manner, that is, using more than a small number of the available algorithms and metrics. Consequently, our study has two main objectives. First, we compare all commonly used, non-application-specific topic modeling algorithms and assess their relative performance. The comparison is made against a known clustering and thus enables an unbiased evaluation of results. Our findings show a clear ranking of the algorithms in terms of accuracy. Secondly, we analyze the relationship between existing metrics and the known clustering, and thus objectively determine under what conditions these algorithms may be utilized effectively. This way, we enable readers to gain a deeper understanding of the performance of topic modeling techniques and the interplay of performance and evaluation metrics.

Suggested Citation

  • Matthias Rüdiger & David Antons & Amol M Joshi & Torsten-Oliver Salge, 2022. "Topic modeling revisited: New evidence on algorithm performance and quality metrics," PLOS ONE, Public Library of Science, vol. 17(4), pages 1-25, April.
  • Handle: RePEc:plo:pone00:0266325
    DOI: 10.1371/journal.pone.0266325
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0266325
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0266325&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0266325?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    2. Sarah Kaplan & Keyvan Vakili, 2015. "The double-edged sword of recombination in breakthrough innovation," Strategic Management Journal, Wiley Blackwell, vol. 36(10), pages 1435-1457, October.
    3. Ding, Chris & Li, Tao & Peng, Wei, 2008. "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing," Computational Statistics & Data Analysis, Elsevier, vol. 52(8), pages 3913-3927, April.
    4. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Pastwa, Anna M. & Shrestha, Prabal & Thewissen, James & Torsin, Wouter, 2021. "Unpacking the black box of ICO white papers: a topic modeling approach," LIDAM Discussion Papers LFIN 2021018, Université catholique de Louvain, Louvain Finance (LFIN).
    2. Maksym Polyakov & Morteza Chalak & Md. Sayed Iftekhar & Ram Pandit & Sorada Tapsuwan & Fan Zhang & Chunbo Ma, 2018. "Authorship, Collaboration, Topics, and Research Gaps in Environmental and Resource Economics 1991–2015," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 71(1), pages 217-239, September.
    3. Bastian Schaefermeier & Gerd Stumme & Tom Hanika, 2021. "Topic space trajectories," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 5759-5795, July.
    4. van Loon, Austin, 2022. "Three Families of Automated Text Analysis," SocArXiv htnej, Center for Open Science.
    5. Christina Bannier & Thomas Pauls & Andreas Walter, 2019. "Content analysis of business communication: introducing a German dictionary," Journal of Business Economics, Springer, vol. 89(1), pages 79-123, February.
    6. Florence Ertel & Simon Donig & Markus Eckl & Sebastian Gassner & Daniel Göler & Malte Rehbein, 2024. "Using web archives for an explorative study of the web presence of German parties during the European election 2019," Quality & Quantity: International Journal of Methodology, Springer, vol. 58(1), pages 603-625, February.
    7. Pratima (Tima) Bansal & Jury Gualandris & Nahyun Kim, 2020. "Theorizing Supply Chains with Qualitative Big Data and Topic Modeling," Journal of Supply Chain Management, Institute for Supply Management, vol. 56(2), pages 7-18, April.
    8. Triss Ashton & Nicholas Evangelopoulos & Victor Prybutok, 2014. "Extending monitoring methods to textual data: a research agenda," Quality & Quantity: International Journal of Methodology, Springer, vol. 48(4), pages 2277-2294, July.
    9. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    10. Mohammad Al Haj Eid & Ala' Omar Dandis & Virginia Cathro & Mathew Parackal, 2024. "Exploring public voice on social media: Twitter Users' views on the circular economy," Sustainable Development, John Wiley & Sons, Ltd., vol. 32(6), pages 6360-6373, December.
    11. Javier De la Hoz-M & Mª José Fernández-Gómez & Susana Mendes, 2021. "LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools," Mathematics, MDPI, vol. 9(14), pages 1-21, July.
    12. Giovanna Maria Dora Dore, 2023. "A Natural Language Processing Analysis of Newspapers Coverage of Hong Kong Protests Between 1998 and 2020," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 169(1), pages 143-166, September.
    13. Thewissen, James & Shrestha, Prabal & Torsin, Wouter & Pastwa, Anna M., 2022. "Unpacking the black box of ICO white papers: A topic modeling approach," Journal of Corporate Finance, Elsevier, vol. 75(C).
    14. Manfred Stede & Yannic Bracke & Luka Borec & Neele Charlotte Kinkel & Maria Skeppstedt, 2023. "Framing climate change in Nature and Science editorials: applications of supervised and unsupervised text categorization," Journal of Computational Social Science, Springer, vol. 6(2), pages 485-513, October.
    15. Jaeho Choi & Anoop Menon & Haris Tabakovic, 2021. "Using machine learning to revisit the diversification–performance relationship," Strategic Management Journal, Wiley Blackwell, vol. 42(9), pages 1632-1661, September.
    16. Anna Calissano & Simone Vantini & Marika Arena, 2020. "Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 29(4), pages 787-812, December.
    17. Hung, Shih-Chang & Chang, Shu-Chen, 2023. "Framing the virus: The political, economic, biomedical and social understandings of the COVID-19 in Taiwan," Technological Forecasting and Social Change, Elsevier, vol. 188(C).
    18. Lu, Jinfeng & Dimov, Dimo, 2023. "A system dynamics modelling of entrepreneurship and growth within firms," Journal of Business Venturing, Elsevier, vol. 38(3).
    19. Irina Wedel & Michael Palk & Stefan Voß, 2022. "A Bilingual Comparison of Sentiment and Topics for a Product Event on Twitter," Information Systems Frontiers, Springer, vol. 24(5), pages 1635-1646, October.
    20. Bernhardt, Lea & Dewenter, Ralf & Thomas, Tobias, 2023. "Measuring partisan media bias in US newscasts from 2001 to 2012," European Journal of Political Economy, Elsevier, vol. 78(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0266325. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.