IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v127y2022i9d10.1007_s11192-022-04318-5.html
   My bibliography  Save this article

Validation of scientific topic models using graph analysis and corpus metadata

Author

Listed:
  • Manuel A. Vázquez

    (Universidad Carlos III de Madrid)

  • Jorge Pereira-Delgado

    (Universidad Carlos III de Madrid)

  • Jesús Cid-Sueiro

    (Universidad Carlos III de Madrid)

  • Jerónimo Arenas-García

    (Universidad Carlos III de Madrid)

Abstract

Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.

Suggested Citation

  • Manuel A. Vázquez & Jorge Pereira-Delgado & Jesús Cid-Sueiro & Jerónimo Arenas-García, 2022. "Validation of scientific topic models using graph analysis and corpus metadata," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5441-5458, September.
  • Handle: RePEc:spr:scient:v:127:y:2022:i:9:d:10.1007_s11192-022-04318-5
    DOI: 10.1007/s11192-022-04318-5
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-022-04318-5
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-022-04318-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Small, Henry & Boyack, Kevin W. & Klavans, Richard, 2014. "Identifying emerging topics in science and technology," Research Policy, Elsevier, vol. 43(8), pages 1450-1467.
    2. Samira Ranaei & Arho Suominen & Alan Porter & Stephen Carley, 2020. "Evaluating technological emergence using text analytics: two case technologies and three approaches," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 215-247, January.
    3. Jie Chen & Jialin Chen & Shu Zhao & Yanping Zhang & Jie Tang, 2020. "Exploiting word embedding for heterogeneous topic model towards patent recommendation," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2091-2108, December.
    4. Kevin W Boyack & David Newman & Russell J Duhon & Richard Klavans & Michael Patek & Joseph R Biberstine & Bob Schijvenaars & André Skupin & Nianli Ma & Katy Börner, 2011. "Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-11, March.
    5. Arho Suominen & Hannes Toivanen, 2016. "Map of science with topic modeling: Comparison of unsupervised learning and human-assigned subject classification," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(10), pages 2464-2476, October.
    6. Yosuke Miyata & Emi Ishita & Fang Yang & Michimasa Yamamoto & Azusa Iwase & Keiko Kurata, 2020. "Knowledge structure transition in library and information science: topic modeling and visualization," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 665-687, October.
    7. Xiaoyao Han, 2020. "Evolution of research topics in LIS between 1996 and 2019: an analysis based on latent Dirichlet allocation topic model," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2561-2595, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tingcan Ma & Ruinan Li & Guiyan Ou & Mingliang Yue, 2018. "Topic based research competitiveness evaluation," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(2), pages 789-803, November.
    2. Gao, Qiang & Liang, Zhentao & Wang, Ping & Hou, Jingrui & Chen, Xiuxiu & Liu, Manman, 2021. "Potential index: Revealing the future impact of research topics based on current knowledge networks," Journal of Informetrics, Elsevier, vol. 15(3).
    3. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    4. Zhentao Liang & Jin Mao & Kun Lu & Gang Li, 2021. "Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9519-9542, December.
    5. Lu, Kun & Yang, Guancan & Wang, Xue, 2022. "Topics emerged in the biomedical field and their characteristics," Technological Forecasting and Social Change, Elsevier, vol. 174(C).
    6. Xu, Shuo & Hao, Liyuan & Yang, Guancan & Lu, Kun & An, Xin, 2021. "A topic models based framework for detecting and forecasting emerging technologies," Technological Forecasting and Social Change, Elsevier, vol. 162(C).
    7. Amber Geurts & Ralph Gutknecht & Philine Warnke & Arjen Goetheer & Elna Schirrmeister & Babette Bakker & Svetlana Meissner, 2022. "New perspectives for data‐supported foresight: The hybrid AI‐expert approach," Futures & Foresight Science, John Wiley & Sons, vol. 4(1), March.
    8. Zhang, Yi & Wu, Mengjia & Miao, Wen & Huang, Lu & Lu, Jie, 2021. "Bi-layer network analytics: A methodology for characterizing emerging general-purpose technologies," Journal of Informetrics, Elsevier, vol. 15(4).
    9. Suominen, Arho & Peng, Haoshu & Ranaei, Samira, 2019. "Examining the dynamics of an emerging research network using the case of triboelectric nanogenerators," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 820-830.
    10. Puccetti, Giovanni & Giordano, Vito & Spada, Irene & Chiarello, Filippo & Fantoni, Gualtiero, 2023. "Technology identification from patent texts: A novel named entity recognition method," Technological Forecasting and Social Change, Elsevier, vol. 186(PB).
    11. Li, Munan & Porter, Alan L. & Suominen, Arho & Burmaoglu, Serhat & Carley, Stephen, 2021. "An exploratory perspective to measure the emergence degree for a specific technology based on the philosophy of swarm intelligence," Technological Forecasting and Social Change, Elsevier, vol. 166(C).
    12. Yi Zhang & Xiaojing Cai & Caroline V. Fry & Mengjia Wu & Caroline S. Wagner, 2021. "Topic evolution, disruption and resilience in early COVID-19 research," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4225-4253, May.
    13. Xu, Shuo & Hao, Liyuan & An, Xin & Yang, Guancan & Wang, Feifei, 2019. "Emerging research topics detection with multiple machine learning models," Journal of Informetrics, Elsevier, vol. 13(4).
    14. Woo, Seokkyun & Youtie, Jan & Ott, Ingrid & Scheu, Fenja, 2021. "Understanding the long-term emergence of autonomous vehicles technologies," Technological Forecasting and Social Change, Elsevier, vol. 170(C).
    15. Samira Ranaei & Arho Suominen & Alan Porter & Stephen Carley, 2020. "Evaluating technological emergence using text analytics: two case technologies and three approaches," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 215-247, January.
    16. Pertti Vakkari & Yu-Wei Chang & Kalervo Järvelin, 2022. "Largest contribution to LIS by external disciplines as measured by the characteristics of research articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(8), pages 4499-4522, August.
    17. Moehrle, Martin G. & Caferoglu, Hüseyin, 2019. "Technological speciation as a source for emerging technologies. Using semantic patent analysis for the case of camera technology," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 776-784.
    18. Peter Sjögårde & Per Ahlgren & Ludo Waltman, 2021. "Algorithmic labeling in hierarchical classifications of publications: Evaluation of bibliographic fields and term weighting approaches," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(7), pages 853-869, July.
    19. Natalya Ivanova & Ekaterina Zolotova, 2023. "Landolt Indicator Values in Modern Research: A Review," Sustainability, MDPI, vol. 15(12), pages 1-22, June.
    20. Zhang, Yi & Huang, Ying & Porter, Alan L. & Zhang, Guangquan & Lu, Jie, 2019. "Discovering and forecasting interactions in big data research: A learning-enhanced bibliometric study," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 795-807.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:127:y:2022:i:9:d:10.1007_s11192-022-04318-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.