IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v116y2018i2d10.1007_s11192-017-2569-6.html
   My bibliography  Save this article

Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus

Author

Listed:
  • Mehmet Ali Abdulhayoglu

    (KU Leuven
    KU Leuven)

  • Bart Thijs

    (KU Leuven)

Abstract

A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times, but relying only on journal information due to massive volume of indexed publications. As a result of paper based match, missing or erroneous items can be completed from other source or the overlap can be measured more reliably. In this context, we focus on measuring the overlap between Clarivate Analytics Web of Science (WoS) and Elsevier’s Scopus at the paper level. Our focus is on detecting exact matches, that is, no false positives are tolerated at all. To this end, we follow a twofold matching procedure. First, a locality sensitive hashing algorithm is applied, which provides fast approximate nearest neighbours and similarities, in order to obtain WoS-Scopus pair suggestions. Second, for each suggested pair, different heuristics are applied to identify those pair of records that indeed refer to the same publication. We observe that at least 74% of WoS publications are also indexed by Scopus. The percentage increases to 92% when only the cited publications are retained. The overlapped WoS records are also presented based on Institute for Scientific Information subject categories (SC). Of those, three big SCs, whose overlap ratios are relatively low, are chosen and examined in detail. Last but not the least, it takes just about an hour to match 14.2 million versus 19.6 million publications from a publication year range of 2004–2013 in a high performance computer environment.

Suggested Citation

  • Mehmet Ali Abdulhayoglu & Bart Thijs, 2018. "Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1229-1245, August.
  • Handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-017-2569-6
    DOI: 10.1007/s11192-017-2569-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-017-2569-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-017-2569-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Myke Gluck, 1990. "A review of journal coverage overlap with an extension to the definition of overlap," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(1), pages 43-60, January.
    2. Lokman I. Meho & Yvonne Rogers, 2008. "Citation counting, citation ranking, and h‐index of human‐computer interaction researchers: A comparison of Scopus and Web of Science," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 59(11), pages 1711-1726, September.
    3. Mehmet Ali Abdulhayoglu & Bart Thijs & Wouter Jeuris, 2016. "Using character n-grams to match a list of publications to references in bibliographic databases," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1525-1546, December.
    4. William W. Hood & Concepción S. Wilson, 2003. "Overlap in bibliographic databases," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 54(12), pages 1091-1103, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Nur Chasanah & Indra Gunawan & Bassam Baroudi, 2024. "International development project success: A literature review," Journal of International Development, John Wiley & Sons, Ltd., vol. 36(1), pages 146-171, January.
    2. Andrea Caputo & Mariya Kargina, 2022. "A user-friendly method to merge Scopus and Web of Science data during bibliometric analysis," Journal of Marketing Analytics, Palgrave Macmillan, vol. 10(1), pages 82-88, March.
    3. Guillaume Cabanac & Ingo Frommholz & Philipp Mayr, 2018. "Bibliometric-enhanced information retrieval: preface," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1225-1227, August.
    4. Matthew Harsh & Ravtosh Bal & Alex Weryha & Justin Whatley & Charles C. Onu & Lisa M. Negro, 2021. "Mapping computer science research in Africa: using academic networking sites for assessing research activity," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(1), pages 305-334, January.
    5. Sahar Mohamadi & Abbas Abbasi & Habib-Allah Ranaei Kordshouli & Kazem Askarifar, 2022. "Conceptualizing sustainable–responsible tourism indicators: an interpretive structural modeling approach," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 24(1), pages 399-425, January.
    6. Junwen Zhu & Weishu Liu, 2020. "A tale of two databases: the use of Web of Science and Scopus in academic papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(1), pages 321-335, April.
    7. Kristina Galjanić & Ivan Marović & Nikša Jajac, 2022. "Decision Support Systems for Managing Construction Projects: A Scientific Evolution Analysis," Sustainability, MDPI, vol. 14(9), pages 1-23, April.
    8. Tanja Mihalic & Sahar Mohamadi & Abbas Abbasi & Lóránt Dénes Dávid, 2021. "Mapping a Sustainable and Responsible Tourism Paradigm: A Bibliometric and Citation Network Analysis," Sustainability, MDPI, vol. 13(2), pages 1-22, January.
    9. Christian Thiele & Gerrit Hirschfeld & Ruth Brachel, 2021. "Clinical trial registries as Scientometric data: A novel solution for linking and deduplicating clinical trials from multiple registries," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9733-9750, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Amador Durán-Sánchez & María de la Cruz del Río-Rama & José à lvarez-García & Cristiana Oliveira, 2022. "Analysis of Worldwide Research on Craft Beer," SAGE Open, , vol. 12(2), pages 21582440221, June.
    2. Amador Durán-Sánchez & José Álvarez-García & María de la Cruz del Río-Rama & Beatriz Rosado-Cebrián, 2019. "Science Mapping of the Knowledge Base on Tourism Innovation," Sustainability, MDPI, vol. 11(12), pages 1-17, June.
    3. Deming Lin & Tianhui Gong & Wenbin Liu & Martin Meyer, 2020. "An entropy-based measure for the evolution of h index research," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2283-2298, December.
    4. José Álvarez-García & Amador Durán-Sánchez & María de la Cruz del Río-Rama & Ronny Correa-Quezada, 2019. "Older Adults and Digital Society: Scientific Coverage," IJERPH, MDPI, vol. 16(11), pages 1-16, June.
    5. De Andrés Fazio, Salvador & Urquía Grande, Elena & Pérez Estébanez, Raquel, 2022. "The “secret life” of the Statement of Cash Flow: A bibliometric analysis," Cuadernos de Gestión, Universidad del País Vasco - Instituto de Economía Aplicada a la Empresa (IEAE).
    6. García-Pérez, Miguel A., 2011. "Strange attractors in the Web of Science database," Journal of Informetrics, Elsevier, vol. 5(1), pages 214-218.
    7. Jakub Rybacki & Dobromił Serwa, 2021. "What Makes a Successful Scientist in a Central Bank? Evidence From the RePEc Database," Central European Journal of Economic Modelling and Econometrics, Central European Journal of Economic Modelling and Econometrics, vol. 13(3), pages 331-357, September.
    8. Mojtaba Ashour & Amir Mahdiyar & Syarmila Hany Haron, 2021. "A Comprehensive Review of Deterrents to the Practice of Sustainable Interior Architecture and Design," Sustainability, MDPI, vol. 13(18), pages 1-19, September.
    9. Salih Selek & Ayman Saleh, 2014. "Use of h index and g index for American academic psychiatry," Scientometrics, Springer;Akadémiai Kiadó, vol. 99(2), pages 541-548, May.
    10. Omar Mubin & Abdullah Al Mahmud & Muneeb Ahmad, 2017. "HCI down under: reflecting on a decade of the OzCHI conference," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(1), pages 367-382, July.
    11. William W. Hood & Concepción S. Wilson, 2003. "Informetric studies using databases: Opportunities and challenges," Scientometrics, Springer;Akadémiai Kiadó, vol. 58(3), pages 587-608, November.
    12. Mingers, John & Yang, Liying, 2017. "Evaluating journal quality: A review of journal citation indicators and ranking in business and management," European Journal of Operational Research, Elsevier, vol. 257(1), pages 323-337.
    13. Frode Eika Sandnes, 2021. "A bibliometric study of human–computer interaction research activity in the Nordic-Baltic Eight countries," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(6), pages 4733-4767, June.
    14. Nobuyuki Shirakawa & Takao Furukawa & Minoru Nomura & Kumi Okuwada, 2012. "Global competition and technological transition in electrical, electronic, information and communication engineering: quantitative analysis of periodicals and conference proceedings of the IEEE," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(3), pages 895-910, June.
    15. Marek Gągolewski & Przemysław Grzegorzewski, 2009. "A geometric approach to the construction of scientific impact indices," Scientometrics, Springer;Akadémiai Kiadó, vol. 81(3), pages 617-634, December.
    16. José Álvarez-García & Claudia Patricia Maldonado-Erazo & María de la Cruz Del Río-Rama & Francisco Javier Castellano-Álvarez, 2019. "Cultural Heritage and Tourism Basis for Regional Development: Mapping of Scientific Coverage," Sustainability, MDPI, vol. 11(21), pages 1-21, October.
    17. Osman Issah & Lúcia Lima Rodrigues, 2021. "Corporate Social Responsibility and Corporate Tax Aggressiveness: A Scientometric Analysis of the Existing Literature to Map the Future," Sustainability, MDPI, vol. 13(11), pages 1-23, June.
    18. Bilal Manzoor & Idris Othman & Juan Carlos Pomares, 2021. "Digital Technologies in the Architecture, Engineering and Construction (AEC) Industry—A Bibliometric—Qualitative Literature Review of Research Activities," IJERPH, MDPI, vol. 18(11), pages 1-26, June.
    19. Bar-Ilan, Judit, 2008. "Informetrics at the beginning of the 21st century—A review," Journal of Informetrics, Elsevier, vol. 2(1), pages 1-52.
    20. García-Pérez, Miguel A., 2012. "An extension of the h index that covers the tail and the top of the citation curve and allows ranking researchers with similar h," Journal of Informetrics, Elsevier, vol. 6(4), pages 689-699.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-017-2569-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.