IDEAS home Printed from https://ideas.repec.org/a/eee/exehis/v87y2023ics0014498322000729.html
   My bibliography  Save this article

Measuring document similarity with weighted averages of word embeddings

Author

Listed:
  • Seegmiller, Bryan
  • Papanikolaou, Dimitris
  • Schmidt, Lawrence D.W.

Abstract

We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method’s usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.

Suggested Citation

  • Seegmiller, Bryan & Papanikolaou, Dimitris & Schmidt, Lawrence D.W., 2023. "Measuring document similarity with weighted averages of word embeddings," Explorations in Economic History, Elsevier, vol. 87(C).
  • Handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000729
    DOI: 10.1016/j.eeh.2022.101494
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0014498322000729
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.eeh.2022.101494?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bryan Kelly & Dimitris Papanikolaou & Amit Seru & Matt Taddy, 2021. "Measuring Technological Innovation over the Long Run," American Economic Review: Insights, American Economic Association, vol. 3(3), pages 303-320, September.
    2. Stephen Hansen & Tejas Ramdas & Raffaella Sadun & Joseph Fuller, 2021. "The Demand for Executive Skills," CESifo Working Paper Series 9152, CESifo.
    3. Gerard Hoberg & Gordon Phillips, 2016. "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy, University of Chicago Press, vol. 124(5), pages 1423-1465.
    4. David Autor & Caroline Chin & Anna Salomons & Bryan Seegmiller, 2024. "New Frontiers: The Origins and Content of New Work, 1940–2018," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(3), pages 1399-1465.
    5. Enghin Atalay & Phai Phongthiengtham & Sebastian Sotelo & Daniel Tannenbaum, 2020. "The Evolution of Work in the United States," American Economic Journal: Applied Economics, American Economic Association, vol. 12(2), pages 1-34, April.
    6. Acemoglu, Daron & Autor, David, 2011. "Skills, Tasks and Technologies: Implications for Employment and Earnings," Handbook of Labor Economics, in: O. Ashenfelter & D. Card (ed.), Handbook of Labor Economics, edition 1, volume 4, chapter 12, pages 1043-1171, Elsevier.
    7. Leonid Kogan & Dimitris Papanikolaou & Lawrence D. W. Schmidt & Bryan Seegmiller, 2021. "Technology, Vintage-Specific Human Capital, and Labor Displacement: Evidence from Linking Patents with Occupations," NBER Working Papers 29552, National Bureau of Economic Research, Inc.
    8. Barbara Biasi & Song Ma, 2022. "The Education-Innovation Gap," CESifo Working Paper Series 9653, CESifo.
    9. Barbara Biasi & Song Ma, 2022. "The Education-Innovation Gap," NBER Working Papers 29853, National Bureau of Economic Research, Inc.
    10. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Lipowski, Cäcilia & Salomons, Anna & Zierahn-Weilage, Ulrich, 2024. "Expertise at work: New technologies, new skills, and worker impacts," ZEW Discussion Papers 24-044, ZEW - Leibniz Centre for European Economic Research.
    2. von Bodman, Nicolas, 2024. "The impact of prospectus language on IPO underpricing: A textual analysis of European IPOs," Junior Management Science (JUMS), Junior Management Science e. V., vol. 9(4), pages 1934-1963.
    3. Andersson, David E. & La Mela, Matti & Tell, Fredrik, 2024. "Family first: Defining, constructing, and applying historical patent families," Explorations in Economic History, Elsevier, vol. 94(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Christina Langer & Simon Wiederhold, 2023. "The Value of Early-Career Skills," Working Papers 222, Bavarian Graduate Program in Economics (BGPE).
    2. John Carter Braxton & Kyle F. Herkenhoff & Jonathan Rothbaum & Lawrence Schmidt, 2021. "Changing Income Risk across the US Skill Distribution: Evidence from a Generalized Kalman Filter," Opportunity and Inclusive Growth Institute Working Papers 55, Federal Reserve Bank of Minneapolis.
    3. Julius Koschnick, 2025. "Teacher-directed scientific change:The case of the English Scientific Revolution," Working Papers 0274, European Historical Economics Society (EHES).
    4. David Autor & Caroline Chin & Anna Salomons & Bryan Seegmiller, 2024. "New Frontiers: The Origins and Content of New Work, 1940–2018," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(3), pages 1399-1465.
    5. Kang, Yankun & Leng, Xuan & Liao, Yunxiang & Zheng, Shilin, 2024. "Information disclosure, spillovers, and knowledge accumulation," China Economic Review, Elsevier, vol. 84(C).
    6. Lipowski, Cäcilia & Salomons, Anna & Zierahn-Weilage, Ulrich, 2024. "Expertise at work: New technologies, new skills, and worker impacts," ZEW Discussion Papers 24-044, ZEW - Leibniz Centre for European Economic Research.
    7. Freund, L. B., 2022. "Superstar Teams: The Micro Origins and Macro Implications of Coworker Complementarities," Janeway Institute Working Papers 2235, Faculty of Economics, University of Cambridge.
    8. Nicholas Bloom & Tarek Alexander Hassan & Aakash Kalyani & Josh Lerner & Ahmed Tahoun, 2021. "The diffusion of disruptive technologies," CEP Discussion Papers dp1798, Centre for Economic Performance, LSE.
    9. Samuel Muehlemann, 2024. "AI Adoption and Workplace Training," Economics of Education Working Paper Series 0232, University of Zurich, Department of Business Administration (IBW).
    10. Caselli, Mauro & Fracasso, Andrea & Scicchitano, Sergio & Traverso, Silvio & Tundis, Enrico, 2025. "What workers and robots do: An activity-based analysis of the impact of robotization on changes in local employment," Research Policy, Elsevier, vol. 54(1).
    11. Marin, Giovanni & Vona, Francesco, 2023. "Finance and the reallocation of scientific, engineering and mathematical talent," Research Policy, Elsevier, vol. 52(5).
    12. Klaus Gugler & Florian Szücs & Ulrich Wohak, 2023. "Start-up Acquisitions, Venture Capital and Innovation: A Comparative Study of Google, Apple, Facebook, Amazon and Microsoft," Department of Economics Working Papers wuwp340, Vienna University of Economics and Business, Department of Economics.
    13. Hege, Ulrich & Li, Kai & Zhang, Yifei, 2025. "Climate Innovation and Carbon Emissions: Evidence from Supply Chain Networks," TSE Working Papers 25-1608, Toulouse School of Economics (TSE).
    14. Guo, Yuchen Mo & Falck, Oliver & Langer, Christina & Lindlacher, Valentin & Wiederhold, Simon, 2024. "Training, Automation, and Wages: Worker-Level Evidence," VfS Annual Conference 2024 (Berlin): Upcoming Labor Market Challenges 302366, Verein für Socialpolitik / German Economic Association.
    15. Hensvik, Lena & Skans, Oskar Nordström, 2023. "The skill-specific impact of past and projected occupational decline," Labour Economics, Elsevier, vol. 81(C).
    16. Samuel Cole & Zachary Cowell & John M. Nunley & R. Alan Seals Jr, 2022. "The Distribution of Occupational Tasks in the United States: Implications for a Diverse and Aging Population," Papers 2205.00497, arXiv.org.
    17. Antonio Martins-Neto & Nanditha Mathew & Pierre Mohnen & Tania Treibich, 2024. "Is There Job Polarization in Developing Economies? A Review and Outlook," The World Bank Research Observer, World Bank, vol. 39(2), pages 259-288.
    18. Ludger Woessmann, 2024. "Skills and Earnings: A Multidimensional Perspective on Human Capital," CESifo Working Paper Series 11428, CESifo.
    19. Max Nathan & Anna Rosso, 2017. "Innovative events," Development Working Papers 429, Centro Studi Luca d'Agliano, University of Milano, revised 08 Apr 2019.
    20. Sergio Ocampo, 2019. "A task-based theory of occupations with multidimensional heterogeneity," 2019 Meeting Papers 477, Society for Economic Dynamics.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000729. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/inca/622830 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.