IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v116y2018i3d10.1007_s11192-018-2824-5.html
   My bibliography  Save this article

Evaluating author name disambiguation for digital libraries: a case of DBLP

Author

Listed:
  • Jinseok Kim

    (Institute for Research on Innovation and Science, University of Michigan)

Abstract

Author name ambiguity in a digital library may affect the findings of research that mines authorship data of the library. This study evaluates author name disambiguation in DBLP, a widely used but insufficiently evaluated digital library for its disambiguation performance. In doing so, this study takes a triangulation approach that author name disambiguation for a digital library can be better evaluated when its performance is assessed on multiple labeled datasets with comparison to baselines. Tested on three types of labeled data containing 5000 to 6 M disambiguated names, DBLP is shown to assign author names quite accurately to distinct authors, resulting in pairwise precision, recall, and F1 measures around 0.90 or above overall. DBLP’s author name disambiguation performs well even on large ambiguous name blocks but deficiently on distinguishing authors with the same names. Compared to other disambiguation algorithms, DBLP’s disambiguation performance is quite competitive, possibly due to its hybrid disambiguation approach combining algorithmic disambiguation and manual error correction. A discussion follows on strengths and weaknesses of labeled datasets used in this study for future efforts to evaluate author name disambiguation on a digital library scale.

Suggested Citation

  • Jinseok Kim, 2018. "Evaluating author name disambiguation for digital libraries: a case of DBLP," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(3), pages 1867-1886, September.
  • Handle: RePEc:spr:scient:v:116:y:2018:i:3:d:10.1007_s11192-018-2824-5
    DOI: 10.1007/s11192-018-2824-5
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-018-2824-5
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-018-2824-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Alison Abbott & David Cyranoski & Nicola Jones & Brendan Maher & Quirin Schiermeier & Richard Van Noorden, 2010. "Metrics: Do metrics matter?," Nature, Nature, vol. 465(7300), pages 860-862, June.
    2. Milojević, Staša, 2013. "Accuracy of simple, initials-based methods for author name disambiguation," Journal of Informetrics, Elsevier, vol. 7(4), pages 767-773.
    3. Kim, Jinseok & Diesner, Jana, 2015. "The effect of data pre-processing on understanding the evolution of collaboration networks," Journal of Informetrics, Elsevier, vol. 9(1), pages 226-236.
    4. Hirotaka Kawashima & Hiroyuki Tomizawa, 2015. "Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan," Scientometrics, Springer;Akadémiai Kiadó, vol. 103(3), pages 1061-1071, June.
    5. Ricardo G. Cota & Anderson A. Ferreira & Cristiano Nascimento & Marcos André Gonçalves & Alberto H. F. Laender, 2010. "An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 61(9), pages 1853-1870, September.
    6. Wanli Liu & Rezarta Islamaj Doğan & Sun Kim & Donald C. Comeau & Won Kim & Lana Yeganova & Zhiyong Lu & W. John Wilbur, 2014. "Author name disambiguation for PubMed," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 65(4), pages 765-781, April.
    7. Ciriaco Andrea D'Angelo & Cristiano Giuffrida & Giovanni Abramo, 2011. "A heuristic approach to author name disambiguation in bibliometrics databases for large‐scale research assessments," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 62(2), pages 257-269, February.
    8. Nuša Erman & Ljupčo Todorovski, 2015. "The effects of measurement error in case of scientific network analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 104(2), pages 453-473, August.
    9. Brent D Fegley & Vetle I Torvik, 2013. "Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption?," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-16, July.
    10. Dongwook Shin & Taehwan Kim & Joongmin Choi & Jungsun Kim, 2014. "Author name disambiguation using a graph model with node splitting and merging based on bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(1), pages 15-50, July.
    11. Lutz Bornmann & Rüdiger Mutz, 2015. "Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(11), pages 2215-2222, November.
    12. John P A Ioannidis & Kevin W Boyack & Richard Klavans, 2014. "Estimates of the Continuously Publishing Core in the Scientific Workforce," PLOS ONE, Public Library of Science, vol. 9(7), pages 1-10, July.
    13. Henk F. Moed & M’hamed Aisati & Andrew Plume, 2013. "Studying scientific migration in Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 94(3), pages 929-942, March.
    14. Mark-Christoph Müller & Florian Reitz & Nicolas Roy, 2017. "Data sets for author name disambiguation: an empirical analysis and a new resource," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1467-1500, June.
    15. Hicks, Diana, 2012. "Performance-based university research funding systems," Research Policy, Elsevier, vol. 41(2), pages 251-261.
    16. Andreas Strotmann & Dangzhi Zhao, 2012. "Author name disambiguation: What difference does it make in author-based citation analysis?," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(9), pages 1820-1833, September.
    17. Ciriaco Andrea D'Angelo & Cristiano Giuffrida & Giovanni Abramo, 2011. "A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 62(2), pages 257-269, February.
    18. Andreas Strotmann & Dangzhi Zhao, 2012. "Author name disambiguation: What difference does it make in author‐based citation analysis?," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 63(9), pages 1820-1833, September.
    19. Massimo Franceschet, 2011. "Collaboration in computer science: A network science approach," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 62(10), pages 1992-2012, October.
    20. Jinseok Kim & Jana Diesner, 2016. "Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(6), pages 1446-1461, June.
    21. Michael Levin & Stefan Krawczyk & Steven Bethard & Dan Jurafsky, 2012. "Citation-based bootstrapping for large-scale author disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(5), pages 1030-1047, May.
    22. Massimo Franceschet, 2011. "Collaboration in computer science: A network science approach," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 62(10), pages 1992-2012, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jinseok Kim & Jenna Kim, 2018. "The impact of imbalanced training data on machine learning for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 511-526, October.
    2. João M. Fernandes & António Costa & Paulo Cortez, 2022. "Author placement in Computer Science: a study based on the careers of ACM Fellows," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(1), pages 351-368, January.
    3. Ciriaco Andrea D’Angelo & Nees Jan Eck, 2020. "Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 883-907, May.
    4. Jinseok Kim, 2019. "A fast and integrative algorithm for clustering performance evaluation in author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(2), pages 661-681, August.
    5. Li Zhang & Wei Lu & Jinqing Yang, 2023. "LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 74(2), pages 168-185, February.
    6. Jinseok Kim & Jinmo Kim & Jason Owen-Smith, 2019. "Generating automatically labeled data for author name disambiguation: an iterative clustering method," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 253-280, January.
    7. Andrea Ancona & Roy Cerqueti & Gianluca Vagnani, 2023. "A novel methodology to disambiguate organization names: an application to EU Framework Programmes data," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(8), pages 4447-4474, August.
    8. Jinseok Kim & Jason Owen-Smith, 2021. "ORCID-linked labeled data for evaluating author name disambiguation at scale," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(3), pages 2057-2083, March.
    9. Antonio De Nicola & Gregorio D’Agostino, 2021. "Assessment of gender divide in scientific communities," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 3807-3840, May.
    10. Jinseok Kim & Jana Diesner, 2019. "Formational bounds of link prediction in collaboration networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 687-706, May.
    11. Mahsa Kaveh & Mahdieh Mirzabeigi & Hajar Sotudeh & Amirsaeid Moloodi, 2022. "The effects of the challenges in the transliteration of Persian names into English on the recall of retrieved results in the web of science," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(2), pages 1099-1128, February.
    12. Jinseok Kim & Jenna Kim & Jason Owen‐Smith, 2021. "Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(8), pages 979-994, August.
    13. Jinseok Kim & Jenna Kim, 2020. "Effect of forename string on author name disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 71(7), pages 839-855, July.
    14. Xiaozan Lyu & Rodrigo Costas, 2021. "Studying the characteristics of scientific communities using individual-level bibliometrics: the case of Big Data research," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6965-6987, August.
    15. Shuo Xu & Ling Li & Xin An, 2023. "Do academic inventors have diverse interests?," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(2), pages 1023-1053, February.
    16. Xu, Shuo & Hao, Liyuan & Yang, Guancan & Lu, Kun & An, Xin, 2021. "A topic models based framework for detecting and forecasting emerging technologies," Technological Forecasting and Social Change, Elsevier, vol. 162(C).
    17. Baruffaldi, Stefano & Poege, Felix, 2020. "A Firm Scientific Community: Industry Participation and Knowledge Diffusion," IZA Discussion Papers 13419, Institute of Labor Economics (IZA).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jinseok Kim & Jason Owen-Smith, 2021. "ORCID-linked labeled data for evaluating author name disambiguation at scale," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(3), pages 2057-2083, March.
    2. Jinseok Kim & Jinmo Kim & Jason Owen-Smith, 2019. "Generating automatically labeled data for author name disambiguation: an iterative clustering method," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 253-280, January.
    3. Jinseok Kim, 2019. "A fast and integrative algorithm for clustering performance evaluation in author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(2), pages 661-681, August.
    4. Jan Schulz, 2016. "Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1283-1298, June.
    5. Ciriaco Andrea D’Angelo & Nees Jan Eck, 2020. "Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 883-907, May.
    6. Kim, Jinseok & Diesner, Jana, 2015. "The effect of data pre-processing on understanding the evolution of collaboration networks," Journal of Informetrics, Elsevier, vol. 9(1), pages 226-236.
    7. Jinseok Kim & Jenna Kim & Jason Owen‐Smith, 2021. "Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(8), pages 979-994, August.
    8. Jinseok Kim & Jenna Kim, 2018. "The impact of imbalanced training data on machine learning for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 511-526, October.
    9. Jinseok Kim & Liang Tao & Seok-Hyoung Lee & Jana Diesner, 2016. "Evolution and structure of scientific co-publishing network in Korea between 1948–2011," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(1), pages 27-41, April.
    10. Mark-Christoph Müller & Florian Reitz & Nicolas Roy, 2017. "Data sets for author name disambiguation: an empirical analysis and a new resource," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1467-1500, June.
    11. Shuiqing Huang & Bo Yang & Sulan Yan & Ronald Rousseau, 2014. "Institution name disambiguation for research assessment," Scientometrics, Springer;Akadémiai Kiadó, vol. 99(3), pages 823-838, June.
    12. Hao Wu & Bo Li & Yijian Pei & Jun He, 2014. "Unsupervised author disambiguation using Dempster–Shafer theory," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(3), pages 1955-1972, December.
    13. Rehs, Andreas, 2021. "A supervised machine learning approach to author disambiguation in the Web of Science," Journal of Informetrics, Elsevier, vol. 15(3).
    14. Jinseok Kim & Jenna Kim, 2020. "Effect of forename string on author name disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 71(7), pages 839-855, July.
    15. Song, Min & Kim, Erin Hea-Jin & Kim, Ha Jin, 2015. "Exploring author name disambiguation on PubMed-scale," Journal of Informetrics, Elsevier, vol. 9(4), pages 924-941.
    16. Helena Mihaljević & Lucía Santamaría, 2021. "Disambiguation of author entities in ADS using supervised learning and graph theory methods," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 3893-3917, May.
    17. Janaína Gomide & Hugo Kling & Daniel Figueiredo, 2017. "Name usage pattern in the synonym ambiguity problem in bibliographic data," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(2), pages 747-766, August.
    18. Milojević, Staša, 2013. "Accuracy of simple, initials-based methods for author name disambiguation," Journal of Informetrics, Elsevier, vol. 7(4), pages 767-773.
    19. Dangzhi Zhao & Andreas Strotmann, 2020. "Telescopic and panoramic views of library and information science research 2011–2018: a comparison of four weighting schemes for author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(1), pages 255-270, July.
    20. Lutz Bornmann & Werner Marx, 2014. "How to evaluate individual researchers working in the natural and life sciences meaningfully? A proposal of methods based on percentiles of citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 98(1), pages 487-509, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:116:y:2018:i:3:d:10.1007_s11192-018-2824-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.