IDEAS home Printed from
   My bibliography  Save this paper

Text matching to measure patent similarity


  • Sam Arts
  • Bruno Cassiman
  • Juan Carlos Gomez


We propose using text matching to measure the technological similarity between patents. Technology experts from different fields validate the new similarity measure and its improvement on measures based on the United States Patent Classification System, and identify its limitations. As an application, we replicate prior findings on the localization of knowledge spillovers by constructing a case-control group of text-matched patents. We also provide open access to the code and data to calculate the similarity between any two utility patents granted by the United States Patent and Trademark Office between 1976 and 2013, or between any two patent portfolios.

Suggested Citation

  • Sam Arts & Bruno Cassiman & Juan Carlos Gomez, 2017. "Text matching to measure patent similarity," Working Papers of Department of Management, Strategy and Innovation, Leuven 590543, KU Leuven, Faculty of Economics and Business (FEB), Department of Management, Strategy and Innovation, Leuven.
  • Handle: RePEc:ete:msiper:590543

    Download full text from publisher

    File URL:
    File Function: MSI_1708
    Download Restriction: no

    Other versions of this item:


    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

    Cited by:

    1. Juan Carlos Gomez, 2019. "Analysis of the effect of data properties in automated patent classification," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(3), pages 1239-1268, December.
    2. Nancy Kong & Uwe Dulleck & Adam Jaffe & Shupeng Sun & Sowmya Vajjala, 2020. "Linguistic Metrics for Patent Disclosure: Evidence from University versus Corporate Patents," CESifo Working Paper Series 8571, CESifo.
    3. Sijie Feng, 2020. "The proximity of ideas: An analysis of patent text using machine learning," PLOS ONE, Public Library of Science, vol. 15(7), pages 1-19, July.
    4. Schnitzer, Monika & Watzinger, Martin, 2019. "Standing on the shoulders of science," CEPR Discussion Papers 13766, C.E.P.R. Discussion Papers.
    5. Kuan, Chung-Huei & Chen, Dar-Zen & Huang, Mu-Hsuan, 2019. "Bibliographically coupled patents: Their temporal pattern and combined relevance," Journal of Informetrics, Elsevier, vol. 13(4).
    6. Nicholas Argyres & Luis A. Rios & Brian S. Silverman, 2020. "Organizational change and the dynamics of innovation: Formal R&D structure and intrafirm inventor networks," Strategic Management Journal, Wiley Blackwell, vol. 41(11), pages 2015-2049, November.
    7. Holger Graf & Matthias Menter, 2020. "Public research and the quality of inventions: the role and impact of entrepreneurial universities and regional network embeddedness," Jena Economic Research Papers 2020-011, Friedrich-Schiller-University Jena.
    8. Kyle W. Higham & Gaétan de Rassenfosse & Adam B. Jaffe, 2020. "Patent Quality: Towards a Systematic Framework for Analysis and Measurement," NBER Working Papers 27598, National Bureau of Economic Research, Inc.
    9. Michaël Bikard, 2020. "Idea twins: Simultaneous discoveries as a research tool," Strategic Management Journal, Wiley Blackwell, vol. 41(8), pages 1528-1543, August.
    10. Bernardo S Buarque & Ronald B Davies & Ryan M Hynes & Dieter F Kogler, 2020. "OK Computer: the creation and integration of AI in Europe," Cambridge Journal of Regions, Economy and Society, Cambridge Political Economy Society, vol. 13(1), pages 175-192.
    11. Jie Chen & Jialin Chen & Shu Zhao & Yanping Zhang & Jie Tang, 2020. "Exploiting word embedding for heterogeneous topic model towards patent recommendation," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2091-2108, December.
    12. Sam Arts & Lee Fleming, 2018. "Paradise of Novelty—Or Loss of Human Capital? Exploring New Fields and Inventive Output," Organization Science, INFORMS, vol. 29(6), pages 1074-1092, December.
    13. Parraguez, Pedro & Škec, Stanko & e Carmo, Duarte Oliveira & Maier, Anja, 2020. "Quantifying technological change as a combinatorial process," Technological Forecasting and Social Change, Elsevier, vol. 151(C).
    14. Cesare Righi & Timothy Simcoe, 2020. "Patenting Inventions or Inventing Patents? Strategic Use of Continuations at the USPTO," NBER Working Papers 27686, National Bureau of Economic Research, Inc.
    15. Changyong Lee & Gyumin Lee, 2019. "Technology opportunity analysis based on recombinant search: patent landscape analysis for idea generation," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 603-632, November.

    More about this item


    text mining; matching; patent; patent classification; technological similarity;

    NEP fields

    This paper has been announced in the following NEP Reports:


    Access and download statistics


    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:ete:msiper:590543. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (library EBIB). General contact details of provider: .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.