IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v111y2017i2d10.1007_s11192-017-2298-x.html
   My bibliography  Save this article

Clustering articles based on semantic similarity

Author

Listed:
  • Shenghui Wang

    (OCLC Research)

  • Rob Koopman

    (OCLC Research)

Abstract

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. As we describe in Koopman et al. (Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf , 2015), the metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.

Suggested Citation

  • Shenghui Wang & Rob Koopman, 2017. "Clustering articles based on semantic similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1017-1031, May.
  • Handle: RePEc:spr:scient:v:111:y:2017:i:2:d:10.1007_s11192-017-2298-x
    DOI: 10.1007/s11192-017-2298-x
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-017-2298-x
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-017-2298-x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Zhang, Lin & Liu, Xinhai & Janssens, Frizo & Liang, Liming & Glänzel, Wolfgang, 2010. "Subject clustering analysis based on ISI category classification," Journal of Informetrics, Elsevier, vol. 4(2), pages 185-193.
    2. Wolfgang Glänzel & Bart Thijs, 2017. "Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1071-1087, May.
    3. Rob Koopman & Shenghui Wang & Andrea Scharnhorst, 2017. "Contextualization of topics: browsing through the universe of bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1119-1139, May.
    4. Jochen Gläser & Wolfgang Glänzel & Andrea Scharnhorst, 2017. "Same data—different results? Towards a comparative approach to the identification of thematic structures in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 981-998, May.
    5. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    6. Kevin W. Boyack & Henry Small & Richard Klavans, 2013. "Improving the accuracy of co-citation clustering using full text," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 64(9), pages 1759-1767, September.
    7. Kevin W. Boyack & Richard Klavans & Katy Börner, 2005. "Mapping the backbone of science," Scientometrics, Springer;Akadémiai Kiadó, vol. 64(3), pages 351-374, August.
    8. Henry Small, 1973. "Co‐citation in the scientific literature: A new measure of the relationship between two documents," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 24(4), pages 265-269, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Rob Koopman & Shenghui Wang & Andrea Scharnhorst, 2017. "Contextualization of topics: browsing through the universe of bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1119-1139, May.
    2. Jochen Gläser & Wolfgang Glänzel & Andrea Scharnhorst, 2017. "Same data—different results? Towards a comparative approach to the identification of thematic structures in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 981-998, May.
    3. Kamal Sanguri & Atanu Bhuyan & Sabyasachi Patra, 2020. "A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 233-269, October.
    4. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    5. Yuan Zhou & Heng Lin & Yufei Liu & Wei Ding, 2019. "A novel method to identify emerging technologies using a semi-supervised topic clustering model: a case of 3D printing industry," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(1), pages 167-185, July.
    6. Rob Koopman & Shenghui Wang, 2017. "Mutual information based labelling and comparing clusters," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1157-1167, May.
    7. Bianca Reichert & Adriano Mendon a Souza, 2022. "Can the Heston Model Forecast Energy Generation? A Systematic Literature Review," International Journal of Energy Economics and Policy, Econjournals, vol. 12(1), pages 289-295.
    8. Mohammed Azmi Al-Betar & Ammar Kamal Abasi & Ghazi Al-Naymat & Kamran Arshad & Sharif Naser Makhadmeh, 2023. "Optimization of scientific publications clustering with ensemble approach for topic extraction," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(5), pages 2819-2877, May.
    9. Qian, Yue & Liu, Yu & Sheng, Quan Z., 2020. "Understanding hierarchical structural evolution in a scientific discipline: A case study of artificial intelligence," Journal of Informetrics, Elsevier, vol. 14(3).
    10. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    11. Christian Weismayer & Ilona Pezenka, 2017. "Identifying emerging research fields: a longitudinal latent semantic keyword analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1757-1785, December.
    12. Hric, Darko & Kaski, Kimmo & Kivelä, Mikko, 2018. "Stochastic block model reveals maps of citation patterns and their evolution in time," Journal of Informetrics, Elsevier, vol. 12(3), pages 757-783.
    13. Bart Thijs, 2020. "Using neural-network based paragraph embeddings for the calculation of within and between document similarities," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 835-849, November.
    14. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sjögårde, Peter & Ahlgren, Per, 2018. "Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics," Journal of Informetrics, Elsevier, vol. 12(1), pages 133-152.
    2. Hric, Darko & Kaski, Kimmo & Kivelä, Mikko, 2018. "Stochastic block model reveals maps of citation patterns and their evolution in time," Journal of Informetrics, Elsevier, vol. 12(3), pages 757-783.
    3. Jochen Gläser & Wolfgang Glänzel & Andrea Scharnhorst, 2017. "Same data—different results? Towards a comparative approach to the identification of thematic structures in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 981-998, May.
    4. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    5. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    6. Ying Huang & Wolfgang Glänzel & Lin Zhang, 2021. "Tracing the development of mapping knowledge domains," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6201-6224, July.
    7. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    8. Yun, Jinhyuk, 2022. "Generalization of bibliographic coupling and co-citation using the node split network," Journal of Informetrics, Elsevier, vol. 16(2).
    9. Christian Weismayer & Ilona Pezenka, 2017. "Identifying emerging research fields: a longitudinal latent semantic keyword analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1757-1785, December.
    10. Carlos Olmeda-Gómez & Carlos Romá-Mateo & Maria-Antonia Ovalle-Perandones, 2019. "Overview of trends in global epigenetic research (2009–2017)," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1545-1574, June.
    11. Shuo Xu & Junwan Liu & Dongsheng Zhai & Xin An & Zheng Wang & Hongshen Pang, 2018. "Overlapping thematic structures extraction with mixed-membership stochastic blockmodel," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 61-84, October.
    12. Andreas Bjurström & Merritt Polk, 2011. "Climate change and interdisciplinarity: a co-citation analysis of IPCC Third Assessment Report," Scientometrics, Springer;Akadémiai Kiadó, vol. 87(3), pages 525-550, June.
    13. Rey-Long Liu, 2017. "A new bibliographic coupling measure with descriptive capability," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 915-935, February.
    14. Jianhua Hou, 2017. "Exploration into the evolution and historical roots of citation analysis by referenced publication year spectroscopy," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(3), pages 1437-1452, March.
    15. Ludo Waltman & Nees Jan Eck, 2012. "A new methodology for constructing a publication-level classification system of science," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(12), pages 2378-2392, December.
    16. William B. Gartner & Per Davidsson & Shaker A. Zahra, 2006. "Are you Talking to Me? The Nature of Community in Entrepreneurship Scholarship," Entrepreneurship Theory and Practice, , vol. 30(3), pages 321-331, May.
    17. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    18. Chen, Xiaoyan & Liu, Yisheng, 2020. "Visualization analysis of high-speed railway research based on CiteSpace," Transport Policy, Elsevier, vol. 85(C), pages 1-17.
    19. Kim, Ha Jin & Jeong, Yoo Kyung & Song, Min, 2016. "Content- and proximity-based author co-citation analysis using citation sentences," Journal of Informetrics, Elsevier, vol. 10(4), pages 954-966.
    20. Nieminen, Paavo & Pölönen, Ilkka & Sipola, Tuomo, 2013. "Research literature clustering using diffusion maps," Journal of Informetrics, Elsevier, vol. 7(4), pages 874-886.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:111:y:2017:i:2:d:10.1007_s11192-017-2298-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.