IDEAS home Printed from https://ideas.repec.org/a/eee/infome/v3y2009i1p49-63.html
   My bibliography  Save this article

Document–document similarity approaches and science mapping: Experimental comparison of five approaches

Author

Listed:
  • Ahlgren, Per
  • Colliander, Cristian

Abstract

This paper treats document–document similarity approaches in the context of science mapping. Five approaches, involving nine methods, are compared experimentally. We compare text-based approaches, the citation-based bibliographic coupling approach, and approaches that combine text-based approaches and bibliographic coupling. Forty-three articles, published in the journal Information Retrieval, are used as test documents. We investigate how well the approaches agree with a ground truth subject classification of the test documents, when the complete linkage method is used, and under two types of similarities, first-order and second-order. The results show that it is possible to achieve a very good approximation of the classification by means of automatic grouping of articles. One text-only method and one combination method, under second-order similarities in both cases, give rise to cluster solutions that to a large extent agree with the classification.

Suggested Citation

  • Ahlgren, Per & Colliander, Cristian, 2009. "Document–document similarity approaches and science mapping: Experimental comparison of five approaches," Journal of Informetrics, Elsevier, vol. 3(1), pages 49-63.
  • Handle: RePEc:eee:infome:v:3:y:2009:i:1:p:49-63
    DOI: 10.1016/j.joi.2008.11.003
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S1751157708000680
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.joi.2008.11.003?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. M. M. Kessler, 1965. "Comparison of the results of bibliographic coupling and analytic subject indexing," American Documentation, Wiley Blackwell, vol. 16(3), pages 223-233, July.
    2. Per Ahlgren & Bo Jarneving, 2008. "Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping," Scientometrics, Springer;Akadémiai Kiadó, vol. 76(2), pages 273-290, August.
    3. Per Ahlgren & Bo Jarneving & Ronald Rousseau, 2003. "Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 54(6), pages 550-560, April.
    4. Bénédicte Vidaillet & V. d'Estaintot & P. Abécassis, 2005. "Introduction," Post-Print hal-00287137, HAL.
    5. M. M. Kessler, 1963. "Bibliographic coupling between scientific papers," American Documentation, Wiley Blackwell, vol. 14(1), pages 10-25, January.
    6. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    7. H. P. F. Peters & R. R. Braam & A. F. J. van Raan, 1995. "Cognitive resemblance and citation relations in chemical engineering publications," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 46(1), pages 9-21, January.
    8. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    2. Hai-Yun Xu & Zeng-Hui Yue & Chao Wang & Kun Dong & Hong-Shen Pang & Zhengbiao Han, 2017. "Multi-source data fusion study in scientometrics," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 773-792, May.
    3. Bart Thijs & Edgar Schiebel & Wolfgang Glänzel, 2013. "Do second-order similarities provide added-value in a hybrid approach?," Scientometrics, Springer;Akadémiai Kiadó, vol. 96(3), pages 667-677, September.
    4. Shuo Xu & Junwan Liu & Dongsheng Zhai & Xin An & Zheng Wang & Hongshen Pang, 2018. "Overlapping thematic structures extraction with mixed-membership stochastic blockmodel," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 61-84, October.
    5. Yang, Hyeonchae & Jung, Woo-Sung, 2016. "Structural efficiency to manipulate public research institution networks," Technological Forecasting and Social Change, Elsevier, vol. 110(C), pages 21-32.
    6. Zhang, Yi & Shang, Lining & Huang, Lu & Porter, Alan L. & Zhang, Guangquan & Lu, Jie & Zhu, Donghua, 2016. "A hybrid similarity measure method for patent portfolio analysis," Journal of Informetrics, Elsevier, vol. 10(4), pages 1108-1130.
    7. Sjögårde, Peter & Ahlgren, Per, 2018. "Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics," Journal of Informetrics, Elsevier, vol. 12(1), pages 133-152.
    8. Wang, Qi & Sandström, Ulf, 2014. "Defining the Role of Cognitive Distance in the Peer Review Process: Explorative Study of a Grant Scheme in Infection Biology," INDEK Working Paper Series 2014/10, Royal Institute of Technology, Department of Industrial Economics and Management.
    9. Cristian Colliander & Per Ahlgren, 2012. "Experimental comparison of first and second-order similarities in a scientometric context," Scientometrics, Springer;Akadémiai Kiadó, vol. 90(2), pages 675-685, February.
    10. Chen, Lixin, 2017. "Do patent citations indicate knowledge linkage? The evidence from text similarities between patents and their citations," Journal of Informetrics, Elsevier, vol. 11(1), pages 63-79.
    11. Bart Thijs, 2020. "Using neural-network based paragraph embeddings for the calculation of within and between document similarities," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 835-849, November.
    12. Debarshi Kumar Sanyal & Plaban Kumar Bhowmick & Partha Pratim Das & Samiran Chattopadhyay & T. Y. S. S. Santosh, 2019. "Enhancing access to scholarly publications with surrogate resources," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 1129-1164, November.
    13. Jing Zhang & Xiaomin Liu & Lili Wu, 2016. "The study of subject-classification based on journal coupling and expert subject-classification system," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1149-1170, June.
    14. Bart Thijs & Lin Zhang & Wolfgang Glänzel, 2015. "Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1453-1467, December.
    15. Fabian Meyer-Brötz & Edgar Schiebel & Leo Brecht, 2017. "Experimental evaluation of parameter settings in calculation of hybrid similarities: effects of first- and second-order similarity, edge cutting, and weighting factors," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1307-1325, June.
    16. Ludo Waltman & Nees Jan Eck, 2012. "A new methodology for constructing a publication-level classification system of science," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(12), pages 2378-2392, December.
    17. Wolfgang Glänzel & Bart Thijs, 2017. "Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1071-1087, May.
    18. Yuen-Hsien Tseng & Ming-Yueh Tsay, 2013. "Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR," Scientometrics, Springer;Akadémiai Kiadó, vol. 95(2), pages 503-528, May.
    19. Dejian Yu & Wanru Wang & Shuai Zhang & Wenyu Zhang & Rongyu Liu, 2017. "Hybrid self-optimized clustering model based on citation links and textual features to detect research topics," PLOS ONE, Public Library of Science, vol. 12(10), pages 1-21, October.
    20. Gómez-Núñez, Antonio J. & Batagelj, Vladimir & Vargas-Quesada, Benjamín & Moya-Anegón, Félix & Chinchilla-Rodríguez, Zaida, 2014. "Optimizing SCImago Journal & Country Rank classification by community detection," Journal of Informetrics, Elsevier, vol. 8(2), pages 369-383.
    21. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jarneving, Bo, 2007. "Complete graphs and bibliographic coupling: A test of the applicability of bibliographic coupling for the identification of cognitive cores on the field level," Journal of Informetrics, Elsevier, vol. 1(4), pages 338-356.
    2. Jarneving, Bo, 2007. "Bibliographic coupling and its application to research-front and other core documents," Journal of Informetrics, Elsevier, vol. 1(4), pages 287-307.
    3. Christian Sternitzke & Isumo Bergmann, 2009. "Similarity measures for document mapping: A comparative study on the level of an individual scientist," Scientometrics, Springer;Akadémiai Kiadó, vol. 78(1), pages 113-130, January.
    4. García-Lillo, Francisco & Seva-Larrosa, Pedro & Sánchez-García, Eduardo, 2023. "What is going on in entrepreneurship research? A bibliometric and SNA analysis," Journal of Business Research, Elsevier, vol. 158(C).
    5. Viergutz, Tim & Schulze-Ehlers, Birgit, 2018. "The use of hybrid scientometric clustering for systematic literature reviews in business and economics," DARE Discussion Papers 1804, Georg-August University of Göttingen, Department of Agricultural Economics and Rural Development (DARE).
    6. David N. Matzig & Clemens Schmid & Felix Riede, 2023. "Mapping the field of cultural evolutionary theory and methods in archaeology using bibliometric methods," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-17, December.
    7. Ding, Ying, 2011. "Community detection: Topological vs. topical," Journal of Informetrics, Elsevier, vol. 5(4), pages 498-514.
    8. van Eck, N.J.P. & Waltman, L., 2009. "How to Normalize Co-Occurrence Data? An Analysis of Some Well-Known Similarity Measures," ERIM Report Series Research in Management ERS-2009-001-LIS, Erasmus Research Institute of Management (ERIM), ERIM is the joint research institute of the Rotterdam School of Management, Erasmus University and the Erasmus School of Economics (ESE) at Erasmus University Rotterdam.
    9. Yang, Siluo & Han, Ruizhen & Wolfram, Dietmar & Zhao, Yuehua, 2016. "Visualizing the intellectual structure of information science (2006–2015): Introducing author keyword coupling analysis," Journal of Informetrics, Elsevier, vol. 10(1), pages 132-150.
    10. Ying Huang & Wolfgang Glänzel & Lin Zhang, 2021. "Tracing the development of mapping knowledge domains," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6201-6224, July.
    11. Jun-Ping Qiu & Ke Dong & Hou-Qiang Yu, 2014. "Comparative study on structure and correlation among author co-occurrence networks in bibliometrics," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(2), pages 1345-1360, November.
    12. Duong, Quang Huy & Zhou, Li & Meng, Meng & Nguyen, Truong Van & Ieromonachou, Petros & Nguyen, Duy Tiep, 2022. "Understanding product returns: A systematic literature review using machine learning and bibliometric analysis," International Journal of Production Economics, Elsevier, vol. 243(C).
    13. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.
    14. Perianes-Rodriguez, Antonio & Waltman, Ludo & van Eck, Nees Jan, 2016. "Constructing bibliometric networks: A comparison between full and fractional counting," Journal of Informetrics, Elsevier, vol. 10(4), pages 1178-1195.
    15. Bo Jarneving, 2001. "The cognitive structure of current cardiovascular research," Scientometrics, Springer;Akadémiai Kiadó, vol. 50(3), pages 365-389, March.
    16. Yun, Jinhyuk & Ahn, Sejung & Lee, June Young, 2020. "Return to basics: Clustering of scientific literature using structural information," Journal of Informetrics, Elsevier, vol. 14(4).
    17. Lola García-Santiago & Felix Moya-Anegón, 2009. "Using co-outlinks to mine heterogeneous networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 79(3), pages 681-702, June.
    18. Nicolaisen, Jeppe & Frandsen, Tove Faber, 2012. "Consensus formation in science modeled by aggregated bibliographic coupling," Journal of Informetrics, Elsevier, vol. 6(2), pages 276-284.
    19. Bar-Ilan, Judit, 2008. "Informetrics at the beginning of the 21st century—A review," Journal of Informetrics, Elsevier, vol. 2(1), pages 1-52.
    20. Chaoqun Ni & Cassidy R. Sugimoto & Jiepu Jiang, 2013. "Venue-author-coupling: A measure for identifying disciplines through author communities," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 64(2), pages 265-279, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:infome:v:3:y:2009:i:1:p:49-63. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/joi .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.