IDEAS home Printed from https://ideas.repec.org/a/eee/respol/v44y2015i9p1672-1701.html
   My bibliography  Save this article

Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records

Author

Listed:
  • Ventura, Samuel L.
  • Nugent, Rebecca
  • Fuchs, Erica R.H.

Abstract

To date, methods used to disambiguate inventors in the United States Patent and Trademark Office (USPTO) database have been rule- and threshold-based (requiring and leveraging expert knowledge) or semi-supervised algorithms trained on statistically generated artificial labels. Using a large, hand-disambiguated set of 98,762 labeled USPTO inventor records from the field of optoelectronics consisting of four sub-samples of inventors with varying characteristics (Akinsanmi et al., 2014) and a second large, hand-disambiguated set of 53,378 labeled inventor records corresponding to a subset of academics in the life sciences (Azoulay et al., 2012), we provide the first supervised learning approach for USPTO inventor disambiguation. Using these two sets of inventor records, we also provide extensive evaluations of both our algorithm and three examples of prior approaches to USPTO disambiguation arguably representative of the range of approaches used to-date. We show that the three past disambiguation algorithms we evaluate demonstrate biases depending on the feature distribution of the target disambiguation population. Both the rule- and threshold-based methods and the semi-supervised approach perform poorly (10–22% false negative error rates) on a random sample of optoelectronics inventors – arguably the closest of our sub-samples to what might be expected of the majority of inventors in the USPTO (based on disambiguation-relevant metrics). The supervised learning approach, using random forests and trained on our labeled optoelectronics dataset, consistently maintains error rates below 3% across all of our available samples. We make public both our labeled optoelectronics inventor records and our code to build supervised learning models and disambiguate inventors (see http://www.cmu.edu/epp/disambiguation). Our code also allows users to implement supervised learning approaches with their own representative labeled training data.

Suggested Citation

  • Ventura, Samuel L. & Nugent, Rebecca & Fuchs, Erica R.H., 2015. "Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records," Research Policy, Elsevier, vol. 44(9), pages 1672-1701.
  • Handle: RePEc:eee:respol:v:44:y:2015:i:9:p:1672-1701
    DOI: 10.1016/j.respol.2014.12.010
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0048733314002406
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.respol.2014.12.010?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Benjamin F. Jones, 2009. "The Burden of Knowledge and the "Death of the Renaissance Man": Is Innovation Getting Harder?," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 76(1), pages 283-317.
    2. Milojević, Staša, 2013. "Accuracy of simple, initials-based methods for author name disambiguation," Journal of Informetrics, Elsevier, vol. 7(4), pages 767-773.
    3. Bronwyn H. Hall & Grid Thoma & Salvatore Torrisi, 2006. "The market value of patents and R&D: Evidence from European firms," KITeS Working Papers 186, KITeS, Centre for Knowledge, Internationalization and Technology Studies, Universita' Bocconi, Milano, Italy, revised Nov 2006.
    4. Manuel Trajtenberg & Gil Shiff & Ran Melamed, 2009. "The "Names Game": Harnessing Inventors, Patent Data for Economic Research," Annals of Economics and Statistics, GENES, issue 93-94, pages 67-77.
    5. Klevorick, Alvin K. & Levin, Richard C. & Nelson, Richard R. & Winter, Sidney G., 1995. "On the sources and significance of interindustry differences in technological opportunities," Research Policy, Elsevier, vol. 24(2), pages 185-205, March.
    6. Raffo, Julio & Lhuillery, Stéphane, 2009. "How to play the "Names Game": Patent retrieval comparing different heuristics," Research Policy, Elsevier, vol. 38(10), pages 1617-1627, December.
    7. Grid Thoma & Salvatore Torrisi & Alfonso Gambardella & Dominique Guellec & Bronwyn H. Hall & Dietmar Harhoff, 2010. "Harmonizing and Combining Large Datasets - An Application to Firm-Level Patent and Accounting Data," NBER Working Papers 15851, National Bureau of Economic Research, Inc.
    8. Li Tang & John P. Walsh, 2010. "Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps," Scientometrics, Springer;Akadémiai Kiadó, vol. 84(3), pages 763-784, September.
    9. Nicolas CARAYOL & Lorenzo CASSI, 2009. "Who\'s Who in Patents. A Bayesian approach," Cahiers du GREThA (2007-2019) 2009-07, Groupe de Recherche en Economie Théorique et Appliquée (GREThA).
    10. Ernest Miguélez & Ismael Gómez-Miguélez, 2011. "“Singling out individual inventors from patent data”," IREA Working Papers 201105, University of Barcelona, Research Institute of Applied Economics, revised May 2011.
    11. John Bound & Clint Cummins & Zvi Griliches & Bronwyn H. Hall & Adam B. Jaffe, 1984. "Who Does R&D and Who Patents?," NBER Chapters, in: R&D, Patents, and Productivity, pages 21-54, National Bureau of Economic Research, Inc.
    12. Brent D Fegley & Vetle I Torvik, 2013. "Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption?," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-16, July.
    13. Bronwyn H. Hall & Adam B. Jaffe & Manuel Trajtenberg, 2001. "The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools," NBER Working Papers 8498, National Bureau of Economic Research, Inc.
    14. Mauricio Sadinle & Stephen E. Fienberg, 2013. "A Generalized Fellegi--Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(502), pages 385-397, June.
    15. Lynne G. Zucker & Michael R. Darby & Jason Fong, 2014. "Communitywide Database Designs for Tracking Innovation Impact: Comets, Stars and Nanobank," Annals of Economics and Statistics, GENES, issue 115-116, pages 277-311.
    16. Pierre Azoulay & Joshua S. Graff Zivin & Jialan Wang, 2010. "Superstar Extinction," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 125(2), pages 549-589.
    17. Jasjit Singh & Lee Fleming, 2010. "Lone Inventors as Sources of Breakthroughs: Myth or Reality?," Management Science, INFORMS, vol. 56(1), pages 41-56, January.
    18. Zvi Griliches, 1984. "R&D, Patents, and Productivity," NBER Books, National Bureau of Economic Research, Inc, number gril84-1, March.
    19. Jasjit Singh, 2005. "Collaborative Networks as Determinants of Knowledge Diffusion Patterns," Management Science, INFORMS, vol. 51(5), pages 756-770, May.
    20. Wesley M. Cohen & Richard R. Nelson & John P. Walsh, 2000. "Protecting Their Intellectual Assets: Appropriability Conditions and Why U.S. Manufacturing Firms Patent (or Not)," NBER Working Papers 7552, National Bureau of Economic Research, Inc.
    21. Li, Guan-Cheng & Lai, Ronald & D’Amour, Alexander & Doolin, David M. & Sun, Ye & Torvik, Vetle I. & Yu, Amy Z. & Fleming, Lee, 2014. "Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010)," Research Policy, Elsevier, vol. 43(6), pages 941-955.
    22. Matt Marx & Deborah Strumsky & Lee Fleming, 2009. "Mobility, Skills, and the Michigan Non-Compete Experiment," Management Science, INFORMS, vol. 55(6), pages 875-889, June.
    23. Francesco Lissoni & Bulat Sanditov & Gianluca Tarasconi, 2006. "The Keins Database on Academic Inventors: Methodology and Contents," KITeS Working Papers 181, KITeS, Centre for Knowledge, Internationalization and Technology Studies, Universita' Bocconi, Milano, Italy, revised Sep 2006.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Suominen, Arho & Toivanen, Hannes & Seppänen, Marko, 2017. "Firms' knowledge profiles: Mapping patent data with unsupervised learning," Technological Forecasting and Social Change, Elsevier, vol. 115(C), pages 131-142.
    2. Abhishek Arora & Xinmei Yang & Shao-Yu Jheng & Melissa Dell, 2023. "Linking Representations with Multimodal Contrastive Learning," Papers 2304.03464, arXiv.org, revised Apr 2023.
    3. David Autor & David Dorn & Gordon H. Hanson & Gary Pisano & Pian Shu, 2020. "Foreign Competition and Domestic Innovation: Evidence from US Patents," American Economic Review: Insights, American Economic Association, vol. 2(3), pages 357-374, September.
    4. Jens Prüfer & Patricia Prüfer, 2020. "Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the Netherlands," Small Business Economics, Springer, vol. 55(3), pages 651-672, October.
    5. Gandal, Neil & Branstetter, Lee & Kunievsky, Nadav, 2017. "Network-Mediated Knowledge Spillovers: A Cross-Country Comparative Analysis of Information Security Innovations," CEPR Discussion Papers 12268, C.E.P.R. Discussion Papers.
    6. Xinmei Yang & Abhishek Arora & Shao-Yu Jheng & Melissa Dell, 2023. "Quantifying Character Similarity with Vision Transformers," Papers 2305.14672, arXiv.org.
    7. Deyun Yin & Kazuyuki Motohashi & Jianwei Dang, 2020. "Large-scale name disambiguation of Chinese patent inventors (1985–2016)," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(2), pages 765-790, February.
    8. Stefano Breschi & Francesco Lissoni & Ernest Miguelez, 2017. "Foreign-origin inventors in the USA: testing for diaspora and brain gain effects," Journal of Economic Geography, Oxford University Press, vol. 17(5), pages 1009-1038.
    9. Yang, Chia-Hsuan & Nugent, Rebecca & Fuchs, Erica R.H., 2016. "Gains from others’ losses: Technology trajectories and the global division of firms," Research Policy, Elsevier, vol. 45(3), pages 724-745.
    10. Ashish Arora & Michelle Gittelman & Sarah Kaplan & John Lynch & Will Mitchell & Nicolaj Siggelkow & Chunmian Ge & Ke-Wei Huang & Ivan P. L. Png, 2016. "Engineer/scientist careers: Patents, online profiles, and misclassification bias," Strategic Management Journal, Wiley Blackwell, vol. 37(1), pages 232-253, January.
    11. Gandal Neil & Kunievsky Nadav & Branstetter Lee, 2021. "Network-Mediated Knowledge Spillovers in ICT/Information Security," Review of Network Economics, De Gruyter, vol. 19(2), pages 85-114, January.
    12. Niccolò Innocenti & Francesco Capone & Luciana Lazzeretti & Sergio Petralia, 2022. "The role of inventors’ networks and variety for breakthrough inventions," Papers in Regional Science, Wiley Blackwell, vol. 101(1), pages 37-57, February.
    13. repec:iab:iabfda:201803(en is not listed on IDEAS
    14. Dorner, Matthias & Harhoff, Dietmar & Gaessler, Fabian & Hoisl, Karin & Poege, Felix, 2019. "Linked Inventor Biography Data 1980-2014 : (INV-BIO ADIAB 8014)," FDZ Datenreport. Documentation on Labour Market Data 201803_en, Institut für Arbeitsmarkt- und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
    15. Francesco Capone & Luciana Lazzeretti & Niccolò Innocenti, 2021. "Innovation and diversity: the role of knowledge networks in the inventive capacity of cities," Small Business Economics, Springer, vol. 56(2), pages 773-788, February.
    16. Vittorio Fuccella & Domenico De Stefano & Maria Prosperina Vitale & Susanna Zaccarin, 2016. "Improving co-authorship network structures by combining multiple data sources: evidence from Italian academic statisticians," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(1), pages 167-184, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Li, Guan-Cheng & Lai, Ronald & D’Amour, Alexander & Doolin, David M. & Sun, Ye & Torvik, Vetle I. & Yu, Amy Z. & Fleming, Lee, 2014. "Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010)," Research Policy, Elsevier, vol. 43(6), pages 941-955.
    2. Michele Pezzoni & Francesco Lissoni & Gianluca Tarasconi, 2014. "How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(1), pages 477-504, October.
    3. Deyun Yin & Kazuyuki Motohashi & Jianwei Dang, 2020. "Large-scale name disambiguation of Chinese patent inventors (1985–2016)," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(2), pages 765-790, February.
    4. Ajay Bhaskarabhatla & Luis Cabral & Deepak Hegde & Thomas Peeters, 2021. "Are Inventors or Firms the Engines of Innovation?," Management Science, INFORMS, vol. 67(6), pages 3899-3920, June.
    5. YIN Deyun & MOTOHASHI Kazuyuki, 2018. "Inventor Name Disambiguation with Gradient Boosting Decision Tree and Inventor Mobility in China (1985-2016)," Discussion papers 18018, Research Institute of Economy, Trade and Industry (RIETI).
    6. Marc Gruber & Dietmar Harhoff & Karin Hoisl, 2013. "Knowledge Recombination Across Technological Boundaries: Scientists vs. Engineers," Management Science, INFORMS, vol. 59(4), pages 837-851, April.
    7. Carayol, Nicolas & Bergé, Laurent & Cassi, Lorenzo & Roux, Pascale, 2019. "Unintended triadic closure in social networks: The strategic formation of research collaborations between French inventors," Journal of Economic Behavior & Organization, Elsevier, vol. 163(C), pages 218-238.
    8. Markus Simeth & Michele Cincera, 2016. "Corporate Science, Innovation, and Firm Value," Management Science, INFORMS, vol. 62(7), pages 1970-1981, July.
    9. Chris Forman & Nicolas van Zeebroeck, 2012. "From Wires to Partners: How the Internet Has Fostered R&D Collaborations Within Firms," Management Science, INFORMS, vol. 58(8), pages 1549-1568, August.
    10. Massimiliano Ferrara & Roberto Mavilia & Bruno Antonio Pansera, 2017. "Extracting knowledge patterns with a social network analysis approach: an alternative methodology for assessing the impact of power inventors," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1593-1625, December.
    11. Dr Chiara Rosazza Bondibene, 2012. "A Study of Patent Thickets," National Institute of Economic and Social Research (NIESR) Discussion Papers 401, National Institute of Economic and Social Research.
    12. Jasjit Singh & Ajay Agrawal, 2011. "Recruiting for Ideas: How Firms Exploit the Prior Inventions of New Hires," Management Science, INFORMS, vol. 57(1), pages 129-150, January.
    13. Dr Chiara Rosazza Bondibene, 2012. "A Study of Patent Thickets," National Institute of Economic and Social Research (NIESR) Discussion Papers 401, National Institute of Economic and Social Research.
    14. Cohen, Wesley M., 2010. "Fifty Years of Empirical Studies of Innovative Activity and Performance," Handbook of the Economics of Innovation, in: Bronwyn H. Hall & Nathan Rosenberg (ed.), Handbook of the Economics of Innovation, edition 1, volume 1, chapter 0, pages 129-213, Elsevier.
    15. Carlino, Gerald & Kerr, William R., 2015. "Agglomeration and Innovation," Handbook of Regional and Urban Economics, in: Gilles Duranton & J. V. Henderson & William C. Strange (ed.), Handbook of Regional and Urban Economics, edition 1, volume 5, chapter 0, pages 349-404, Elsevier.
    16. Mohd Shadab Danish & Pritam Ranjan & Ruchi Sharma, 2022. "Assessing the Impact of Patent Attributes on the Value of Discrete and Complex Innovations," Papers 2208.07222, arXiv.org.
    17. Choi, Mincheol & Lee, Chang-Yang, 2021. "Technological diversification and R&D productivity: The moderating effects of knowledge spillovers and core-technology competence," Technovation, Elsevier, vol. 104(C).
    18. Silvestri, Daniela & Riccaboni, Massimo & Della Malva, Antonio, 2018. "Sailing in all winds: Technological search over the business cycle," Research Policy, Elsevier, vol. 47(10), pages 1933-1944.
    19. Crescenzi, Riccardo & Nathan, Max & Rodríguez-Pose, Andrés, 2016. "Do inventors talk to strangers? On proximity and collaborative knowledge creation," Research Policy, Elsevier, vol. 45(1), pages 177-194.
    20. Martin Kalthaus, 2020. "Knowledge recombination along the technology life cycle," Journal of Evolutionary Economics, Springer, vol. 30(3), pages 643-704, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:respol:v:44:y:2015:i:9:p:1672-1701. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/respol .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.