IDEAS home Printed from https://ideas.repec.org/a/eee/infome/v16y2022i1s175115772100095x.html
   My bibliography  Save this article

Robustness, replicability and scalability in topic modelling

Author

Listed:
  • Ballester, Omar
  • Penner, Orion

Abstract

Approaches for estimating the similarity between individual publications are an area of long-standing interest in the scientometrics and informetrics communities. Traditional techniques have generally relied on references and other metadata, while text mining approaches based on title and abstract text have appeared more frequently in recent years. In principle, topic models have great potential in this domain. But, in practice, they are often difficult to employ successfully, and are notoriously inconsistent as latent space dimension grows. In this manuscript we identify the three properties all usable topic models should have: robustness, descriptive power and reflection of reality. We develop a novel method for evaluating the robustness of topic models and suggest a metric to assess and benchmark descriptive power as number of topics scale. Employing that procedure, we find that the neural-network-based paragraph embedding approach seems capable of providing statistically robust estimates of the document–document similarities, even for topic spaces far larger than what is usually considered prudent for the most common topic model approaches.

Suggested Citation

  • Ballester, Omar & Penner, Orion, 2022. "Robustness, replicability and scalability in topic modelling," Journal of Informetrics, Elsevier, vol. 16(1).
  • Handle: RePEc:eee:infome:v:16:y:2022:i:1:s175115772100095x
    DOI: 10.1016/j.joi.2021.101224
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S175115772100095X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.joi.2021.101224?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Daniel D. Lee & H. Sebastian Seung, 1999. "Learning the parts of objects by non-negative matrix factorization," Nature, Nature, vol. 401(6755), pages 788-791, October.
    2. Robert R. Braam & Henk F. Moed & Anthony F. J. van Raan, 1991. "Mapping of science by combined co‐citation and word analysis. I. Structural aspects," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(4), pages 233-251, May.
    3. David Lenz & Peter Winker, 2020. "Measuring the diffusion of innovations with paragraph vector topic models," PLOS ONE, Public Library of Science, vol. 15(1), pages 1-18, January.
    4. Richard Klavans & Kevin W. Boyack, 2017. "Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge?," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 68(4), pages 984-998, April.
    5. Yan, Erjia & Ding, Ying & Milojević, Staša & Sugimoto, Cassidy R., 2012. "Topics in dynamic research communities: An exploratory study for the field of information retrieval," Journal of Informetrics, Elsevier, vol. 6(1), pages 140-153.
    6. Jochen Gläser & Wolfgang Glänzel & Andrea Scharnhorst, 2017. "Same data—different results? Towards a comparative approach to the identification of thematic structures in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 981-998, May.
    7. Ayoubi, Charles & Barbosu, Sandra & Pezzoni, Michele & Visentin, Fabiana, 2020. "What matters in funding: The value of research coherence and alignment in evaluators' decisions," MERIT Working Papers 2020-010, United Nations University - Maastricht Economic and Social Research Institute on Innovation and Technology (MERIT).
    8. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    9. Kun Lu & Dietmar Wolfram, 2012. "Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(10), pages 1973-1986, October.
    10. Robert R. Braam & Henk F. Moed & Anthony F. J. van Raan, 1991. "Mapping of science by combined co‐citation and word analysis. II: Dynamical aspects," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(4), pages 252-266, May.
    11. Wagner, Caroline S. & Roessner, J. David & Bobb, Kamau & Klein, Julie Thompson & Boyack, Kevin W. & Keyton, Joann & Rafols, Ismael & Börner, Katy, 2011. "Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature," Journal of Informetrics, Elsevier, vol. 5(1), pages 14-26.
    12. Kevin W Boyack & David Newman & Russell J Duhon & Richard Klavans & Michael Patek & Joseph R Biberstine & Bob Schijvenaars & André Skupin & Nianli Ma & Katy Börner, 2011. "Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-11, March.
    13. Kun Lu & Dietmar Wolfram, 2012. "Measuring author research relatedness: A comparison of word‐based, topic‐based, and author cocitation approaches," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 63(10), pages 1973-1986, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Kolomoyets, Yuliya & Dickinger, Astrid, 2023. "Understanding value perceptions and propositions: A machine learning approach," Journal of Business Research, Elsevier, vol. 154(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    2. Sjögårde, Peter & Ahlgren, Per, 2018. "Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics," Journal of Informetrics, Elsevier, vol. 12(1), pages 133-152.
    3. Carlos Olmeda-Gómez & Carlos Romá-Mateo & Maria-Antonia Ovalle-Perandones, 2019. "Overview of trends in global epigenetic research (2009–2017)," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1545-1574, June.
    4. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    5. Lu Huang & Yijie Cai & Erdong Zhao & Shengting Zhang & Yue Shu & Jiao Fan, 2022. "Measuring the interdisciplinarity of Information and Library Science interactions using citation analysis and semantic analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6733-6761, November.
    6. Rob Koopman & Shenghui Wang & Andrea Scharnhorst, 2017. "Contextualization of topics: browsing through the universe of bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1119-1139, May.
    7. Andrea Bonaccorsi & Nicola Melluso & Francesco Alessandro Massucci, 2022. "Exploring the antecedents of interdisciplinarity at the European Research Council: a topic modeling approach," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 6961-6991, December.
    8. Samira Ranaei & Arho Suominen & Alan Porter & Stephen Carley, 2020. "Evaluating technological emergence using text analytics: two case technologies and three approaches," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 215-247, January.
    9. Leah G. Nichols, 2014. "A topic model approach to measuring interdisciplinarity at the National Science Foundation," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(3), pages 741-754, September.
    10. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    11. Yang, Siluo & Han, Ruizhen & Wolfram, Dietmar & Zhao, Yuehua, 2016. "Visualizing the intellectual structure of information science (2006–2015): Introducing author keyword coupling analysis," Journal of Informetrics, Elsevier, vol. 10(1), pages 132-150.
    12. Ying Huang & Wolfgang Glänzel & Lin Zhang, 2021. "Tracing the development of mapping knowledge domains," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6201-6224, July.
    13. Bettencourt, Luís M.A. & Kaiser, David I. & Kaur, Jasleen, 2009. "Scientific discovery and topological transitions in collaboration networks," Journal of Informetrics, Elsevier, vol. 3(3), pages 210-221.
    14. Shenghui Wang & Rob Koopman, 2017. "Clustering articles based on semantic similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1017-1031, May.
    15. Yun, Jinhyuk & Ahn, Sejung & Lee, June Young, 2020. "Return to basics: Clustering of scientific literature using structural information," Journal of Informetrics, Elsevier, vol. 14(4).
    16. Matthias Held & Grit Laudel & Jochen Gläser, 2021. "Challenges to the validity of topic reconstruction," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4511-4536, May.
    17. MaruÅ¡a Premru & Matej ÄŒerne & SaÅ¡a BatistiÄ, 2022. "The Road to the Future: A Multi-Technique Bibliometric Review and Development Projections of the Leader–Member Exchange (LMX) Research," SAGE Open, , vol. 12(2), pages 21582440221, May.
    18. Xie, Qing & Zhang, Xinyuan & Song, Min, 2021. "A network embedding-based scholar assessment indicator considering four facets: Research topic, author credit allocation, field-normalized journal impact, and published time," Journal of Informetrics, Elsevier, vol. 15(4).
    19. Juste Raimbault, 2019. "Exploration of an interdisciplinary scientific landscape," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 617-641, May.
    20. Alfonso Ávila-Robinson & Cristian Mejia & Shintaro Sengoku, 2021. "Are bibliometric measures consistent with scientists’ perceptions? The case of interdisciplinarity in research," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(9), pages 7477-7502, September.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:infome:v:16:y:2022:i:1:s175115772100095x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/joi .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.