IDEAS home Printed from https://ideas.repec.org/a/eee/infome/v16y2022i1s175115772100095x.html
   My bibliography  Save this article

Robustness, replicability and scalability in topic modelling

Author

Listed:
  • Ballester, Omar
  • Penner, Orion

Abstract

Approaches for estimating the similarity between individual publications are an area of long-standing interest in the scientometrics and informetrics communities. Traditional techniques have generally relied on references and other metadata, while text mining approaches based on title and abstract text have appeared more frequently in recent years. In principle, topic models have great potential in this domain. But, in practice, they are often difficult to employ successfully, and are notoriously inconsistent as latent space dimension grows. In this manuscript we identify the three properties all usable topic models should have: robustness, descriptive power and reflection of reality. We develop a novel method for evaluating the robustness of topic models and suggest a metric to assess and benchmark descriptive power as number of topics scale. Employing that procedure, we find that the neural-network-based paragraph embedding approach seems capable of providing statistically robust estimates of the document–document similarities, even for topic spaces far larger than what is usually considered prudent for the most common topic model approaches.

Suggested Citation

  • Ballester, Omar & Penner, Orion, 2022. "Robustness, replicability and scalability in topic modelling," Journal of Informetrics, Elsevier, vol. 16(1).
  • Handle: RePEc:eee:infome:v:16:y:2022:i:1:s175115772100095x
    DOI: 10.1016/j.joi.2021.101224
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S175115772100095X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.joi.2021.101224?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Ayoubi, Charles & Barbosu, Sandra & Pezzoni, Michele & Visentin, Fabiana, 2020. "What matters in funding: The value of research coherence and alignment in evaluators' decisions," MERIT Working Papers 2020-010, United Nations University - Maastricht Economic and Social Research Institute on Innovation and Technology (MERIT).
    2. Kun Lu & Dietmar Wolfram, 2012. "Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(10), pages 1973-1986, October.
    3. Daniel D. Lee & H. Sebastian Seung, 1999. "Learning the parts of objects by non-negative matrix factorization," Nature, Nature, vol. 401(6755), pages 788-791, October.
    4. Robert R. Braam & Henk F. Moed & Anthony F. J. van Raan, 1991. "Mapping of science by combined co‐citation and word analysis. I. Structural aspects," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(4), pages 233-251, May.
    5. David Lenz & Peter Winker, 2020. "Measuring the diffusion of innovations with paragraph vector topic models," PLOS ONE, Public Library of Science, vol. 15(1), pages 1-18, January.
    6. Richard Klavans & Kevin W. Boyack, 2017. "Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge?," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 68(4), pages 984-998, April.
    7. Robert R. Braam & Henk F. Moed & Anthony F. J. van Raan, 1991. "Mapping of science by combined co‐citation and word analysis. II: Dynamical aspects," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(4), pages 252-266, May.
    8. Kevin W Boyack & David Newman & Russell J Duhon & Richard Klavans & Michael Patek & Joseph R Biberstine & Bob Schijvenaars & André Skupin & Nianli Ma & Katy Börner, 2011. "Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-11, March.
    9. Wagner, Caroline S. & Roessner, J. David & Bobb, Kamau & Klein, Julie Thompson & Boyack, Kevin W. & Keyton, Joann & Rafols, Ismael & Börner, Katy, 2011. "Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature," Journal of Informetrics, Elsevier, vol. 5(1), pages 14-26.
    10. Yan, Erjia & Ding, Ying & Milojević, Staša & Sugimoto, Cassidy R., 2012. "Topics in dynamic research communities: An exploratory study for the field of information retrieval," Journal of Informetrics, Elsevier, vol. 6(1), pages 140-153.
    11. Jochen Gläser & Wolfgang Glänzel & Andrea Scharnhorst, 2017. "Same data—different results? Towards a comparative approach to the identification of thematic structures in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 981-998, May.
    12. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    13. Kun Lu & Dietmar Wolfram, 2012. "Measuring author research relatedness: A comparison of word‐based, topic‐based, and author cocitation approaches," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 63(10), pages 1973-1986, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Armenia, Stefano & Franco, Eduardo & Iandolo, Francesca & Maielli, Giuliano & Vito, Pietro, 2024. "Zooming in and out the landscape: Artificial intelligence and system dynamics in business and management," Technological Forecasting and Social Change, Elsevier, vol. 200(C).
    2. Kolomoyets, Yuliya & Dickinger, Astrid, 2023. "Understanding value perceptions and propositions: A machine learning approach," Journal of Business Research, Elsevier, vol. 154(C).
    3. Qianqian Xie & Ludo Waltman, 2025. "A comparison of citation-based clustering and topic modeling for science mapping," Scientometrics, Springer;Akadémiai Kiadó, vol. 130(5), pages 2497-2522, May.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sjögårde, Peter & Ahlgren, Per, 2018. "Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics," Journal of Informetrics, Elsevier, vol. 12(1), pages 133-152.
    2. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    3. Carlos Olmeda-Gómez & Carlos Romá-Mateo & Maria-Antonia Ovalle-Perandones, 2019. "Overview of trends in global epigenetic research (2009–2017)," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1545-1574, June.
    4. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    5. Lu Huang & Yijie Cai & Erdong Zhao & Shengting Zhang & Yue Shu & Jiao Fan, 2022. "Measuring the interdisciplinarity of Information and Library Science interactions using citation analysis and semantic analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6733-6761, November.
    6. Samira Ranaei & Arho Suominen & Alan Porter & Stephen Carley, 2020. "Evaluating technological emergence using text analytics: two case technologies and three approaches," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 215-247, January.
    7. Rob Koopman & Shenghui Wang & Andrea Scharnhorst, 2017. "Contextualization of topics: browsing through the universe of bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1119-1139, May.
    8. Qianqian Xie & Ludo Waltman, 2025. "A comparison of citation-based clustering and topic modeling for science mapping," Scientometrics, Springer;Akadémiai Kiadó, vol. 130(5), pages 2497-2522, May.
    9. Leah G. Nichols, 2014. "A topic model approach to measuring interdisciplinarity at the National Science Foundation," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(3), pages 741-754, September.
    10. Andrea Bonaccorsi & Nicola Melluso & Francesco Alessandro Massucci, 2022. "Exploring the antecedents of interdisciplinarity at the European Research Council: a topic modeling approach," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 6961-6991, December.
    11. Michael Rennings & Philipp Baaden & Carolin Block & Marcus John & Stefanie Bröring, 2024. "Assessing emerging sustainability-oriented technologies: the case of precision agriculture," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(6), pages 2969-2998, June.
    12. Yun, Jinhyuk, 2022. "Generalization of bibliographic coupling and co-citation using the node split network," Journal of Informetrics, Elsevier, vol. 16(2).
    13. Yeow Chong Goh & Xin Qing Cai & Walter Theseira & Giovanni Ko & Khiam Aik Khor, 2020. "Evaluating human versus machine learning performance in classifying research abstracts," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1197-1212, November.
    14. Giulio Giacomo Cantone, 2024. "How to measure interdisciplinary research? A systemic design for the model of measurement," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(8), pages 4937-4982, August.
    15. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    16. Karine Bastos Leal & Luís Eduardo de Souza Robaina & André de Souza De Lima, 2022. "Coastal impacts of storm surges on a changing climate: a global bibliometric analysis," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 114(2), pages 1455-1476, November.
    17. Yang, Siluo & Han, Ruizhen & Wolfram, Dietmar & Zhao, Yuehua, 2016. "Visualizing the intellectual structure of information science (2006–2015): Introducing author keyword coupling analysis," Journal of Informetrics, Elsevier, vol. 10(1), pages 132-150.
    18. Yuehua Zhao & Jin Zhang & Min Wu, 2019. "Finding Users’ Voice on Social Media: An Investigation of Online Support Groups for Autism-Affected Users on Facebook," IJERPH, MDPI, vol. 16(23), pages 1-13, November.
    19. Ying Huang & Wolfgang Glänzel & Lin Zhang, 2021. "Tracing the development of mapping knowledge domains," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6201-6224, July.
    20. Nees Jan Eck & Ludo Waltman, 2017. "Citation-based clustering of publications using CitNetExplorer and VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1053-1070, May.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:infome:v:16:y:2022:i:1:s175115772100095x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/joi .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.