IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0070299.html
   My bibliography  Save this article

Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption?

Author

Listed:
  • Brent D Fegley
  • Vetle I Torvik

Abstract

The assumption that a name uniquely identifies an entity introduces two types of errors: splitting treats one entity as two or more (because of name variants); lumping treats multiple entities as if they were one (because of shared names). Here we investigate the extent to which splitting and lumping affect commonly-used measures of large-scale named-entity networks within two disambiguated bibliographic datasets: one for co-author names in biomedicine (PubMed, 2003–2007); the other for co-inventor names in U.S. patents (USPTO, 2003–2007). In both cases, we find that splitting has relatively little effect, whereas lumping has a dramatic effect on network measures. For example, in the biomedical co-authorship network, lumping (based on last name and both initials) drives several measures down: the global clustering coefficient by a factor of 4 (from 0.265 to 0.066); degree assortativity by a factor of ∼13 (from 0.763 to 0.06); and average shortest path by a factor of 1.3 (from 5.9 to 4.5). These results can be explained in part by the fact that lumping artificially creates many intransitive relationships and high-degree vertices. This effect of lumping is much less dramatic but persists with measures that give less weight to high-degree vertices, such as the mean local clustering coefficient and log-based degree assortativity. Furthermore, the log-log distribution of collaborator counts follows a much straighter line (power law) with splitting and lumping errors than without, particularly at the low and the high counts. This suggests that part of the power law often observed for collaborator counts in science and technology reflects an artifact: name ambiguity.

Suggested Citation

  • Brent D Fegley & Vetle I Torvik, 2013. "Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption?," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-16, July.
  • Handle: RePEc:plo:pone00:0070299
    DOI: 10.1371/journal.pone.0070299
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0070299
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0070299&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0070299?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Staša Milojević, 2010. "Modes of collaboration in modern science: Beyond power laws and preferential attachment," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 61(7), pages 1410-1423, July.
    2. Bing He & Ying Ding & Chaoqun Ni, 2011. "Mining enriched contextual information of scientific collaboration: A meso perspective," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 62(5), pages 831-845, May.
    3. Staša Milojević, 2010. "Modes of collaboration in modern science: Beyond power laws and preferential attachment," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 61(7), pages 1410-1423, July.
    4. Bing He & Ying Ding & Chaoqun Ni, 2011. "Mining enriched contextual information of scientific collaboration: A meso perspective," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 62(5), pages 831-845, May.
    5. Barabási, A.L & Jeong, H & Néda, Z & Ravasz, E & Schubert, A & Vicsek, T, 2002. "Evolution of the social network of scientific collaborations," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 311(3), pages 590-614.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jinseok Kim & Jenna Kim, 2018. "The impact of imbalanced training data on machine learning for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 511-526, October.
    2. Jinseok Kim & Liang Tao & Seok-Hyoung Lee & Jana Diesner, 2016. "Evolution and structure of scientific co-publishing network in Korea between 1948–2011," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(1), pages 27-41, April.
    3. Jinseok Kim, 2019. "A fast and integrative algorithm for clustering performance evaluation in author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(2), pages 661-681, August.
    4. Jinseok Kim & Jason Owen-Smith, 2021. "ORCID-linked labeled data for evaluating author name disambiguation at scale," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(3), pages 2057-2083, March.
    5. Jinseok Kim & Jana Diesner, 2019. "Formational bounds of link prediction in collaboration networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 687-706, May.
    6. Ventura, Samuel L. & Nugent, Rebecca & Fuchs, Erica R.H., 2015. "Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records," Research Policy, Elsevier, vol. 44(9), pages 1672-1701.
    7. Janaína Gomide & Hugo Kling & Daniel Figueiredo, 2017. "Name usage pattern in the synonym ambiguity problem in bibliographic data," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(2), pages 747-766, August.
    8. YIN Deyun & MOTOHASHI Kazuyuki, 2018. "Inventor Name Disambiguation with Gradient Boosting Decision Tree and Inventor Mobility in China (1985-2016)," Discussion papers 18018, Research Institute of Economy, Trade and Industry (RIETI).
    9. Deyun Yin & Kazuyuki Motohashi & Jianwei Dang, 2020. "Large-scale name disambiguation of Chinese patent inventors (1985–2016)," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(2), pages 765-790, February.
    10. Dangzhi Zhao & Andreas Strotmann, 2020. "Telescopic and panoramic views of library and information science research 2011–2018: a comparison of four weighting schemes for author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(1), pages 255-270, July.
    11. Jinseok Kim, 2018. "Evaluating author name disambiguation for digital libraries: a case of DBLP," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(3), pages 1867-1886, September.
    12. Kim, Jinseok & Diesner, Jana, 2015. "The effect of data pre-processing on understanding the evolution of collaboration networks," Journal of Informetrics, Elsevier, vol. 9(1), pages 226-236.
    13. Li, Guan-Cheng & Lai, Ronald & D’Amour, Alexander & Doolin, David M. & Sun, Ye & Torvik, Vetle I. & Yu, Amy Z. & Fleming, Lee, 2014. "Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010)," Research Policy, Elsevier, vol. 43(6), pages 941-955.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. J. Sylvan Katz & Guillermo Armando Ronda-Pupo, 2019. "Cooperation, scale-invariance and complex innovation systems: a generalization," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 1045-1065, November.
    2. Chao Lu & Yingyi Zhang & Yong‐Yeol Ahn & Ying Ding & Chenwei Zhang & Dandan Ma, 2020. "Co‐contributorship network and division of labor in individual scientific collaborations," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 71(10), pages 1162-1178, October.
    3. He, Bing & Ding, Ying & Tang, Jie & Reguramalingam, Vignesh & Bollen, Johan, 2013. "Mining diversity subgraph in multidisciplinary scientific collaboration networks: A meso perspective," Journal of Informetrics, Elsevier, vol. 7(1), pages 117-128.
    4. Liu, Junwan & Guo, Xiaofei & Xu, Shuo & Song, Yinglu & Ding, Kaiyue, 2023. "A new interpretation of scientific collaboration patterns from the perspective of symbiosis: An investigation for long-term collaboration in publications," Journal of Informetrics, Elsevier, vol. 17(1).
    5. Bordons, María & Aparicio, Javier & González-Albo, Borja & Díaz-Faes, Adrián A., 2015. "The relationship between the research performance of scientists and their position in co-authorship networks in three fields," Journal of Informetrics, Elsevier, vol. 9(1), pages 135-144.
    6. Zheng Xie, 2019. "A cooperative game model for the multimodality of coauthorship networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(1), pages 503-519, October.
    7. Jinseok Kim & Jana Diesner, 2019. "Formational bounds of link prediction in collaboration networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 687-706, May.
    8. Zhai, Li & Yan, Xiangbin, 2022. "A directed collaboration network for exploring the order of scientific collaboration," Journal of Informetrics, Elsevier, vol. 16(4).
    9. Hamid Bouabid & Hind Achachi, 2022. "Size of science team at university and internal co-publications: science policy implications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 6993-7013, December.
    10. He, Chaocheng & Liu, Fuzhen & Dong, Ke & Wu, Jiang & Zhang, Qingpeng, 2023. "Research on the formation mechanism of research leadership relations: An exponential random graph model analysis approach," Journal of Informetrics, Elsevier, vol. 17(2).
    11. Inoue, Masaaki & Pham, Thong & Shimodaira, Hidetoshi, 2020. "Joint estimation of non-parametric transitivity and preferential attachment functions in scientific co-authorship networks," Journal of Informetrics, Elsevier, vol. 14(3).
    12. Binglu Wang & Yi Bu & Yang Xu, 2018. "A quantitative exploration on reasons for citing articles from the perspective of cited authors," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 675-687, August.
    13. Abbasi, Alireza & Hossain, Liaquat & Leydesdorff, Loet, 2012. "Betweenness centrality as a driver of preferential attachment in the evolution of research collaboration networks," Journal of Informetrics, Elsevier, vol. 6(3), pages 403-412.
    14. Fengchao Liu & Na Zhang & Cong Cao, 2017. "An evolutionary process of global nanotechnology collaboration: a social network analysis of patents at USPTO," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1449-1465, June.
    15. Xie, Zonglin & Xie, Zheng & Li, Jianping & Yang, Qian, 2018. "Exploring the influence of social activity on scientific career," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 500(C), pages 189-198.
    16. Zheng Xie & Zonglin Xie & Miao Li & Jianping Li & Dongyun Yi, 2017. "Modeling the coevolution between citations and coauthorship of scientific papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(1), pages 483-507, July.
    17. Li Zhai & Xiujuan Li & Xiangbin Yan & Weiguo Fan, 2014. "Evolutionary analysis of collaboration networks in the field of information systems," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(3), pages 1657-1677, December.
    18. Noémi Gaskó & Rodica Ioana Lung & Mihai Alexandru Suciu, 2016. "A new network model for the study of scientific collaborations: Romanian computer science and mathematics co-authorship networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 108(2), pages 613-632, August.
    19. Wu, Leyan & Yi, Fan & Bu, Yi & Lu, Wei & Huang, Yong, 2024. "Toward scientific collaboration: A cost-benefit perspective," Research Policy, Elsevier, vol. 53(2).
    20. He, Bing & Ding, Ying & Yan, Erjia, 2012. "Mining patterns of author orders in scientific publications," Journal of Informetrics, Elsevier, vol. 6(3), pages 359-367.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0070299. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.