IDEAS home Printed from https://ideas.repec.org/a/eee/exehis/v87y2023ics0014498322000729.html
   My bibliography  Save this article

Measuring document similarity with weighted averages of word embeddings

Author

Listed:
  • Seegmiller, Bryan
  • Papanikolaou, Dimitris
  • Schmidt, Lawrence D.W.

Abstract

We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method’s usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.

Suggested Citation

  • Seegmiller, Bryan & Papanikolaou, Dimitris & Schmidt, Lawrence D.W., 2023. "Measuring document similarity with weighted averages of word embeddings," Explorations in Economic History, Elsevier, vol. 87(C).
  • Handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000729
    DOI: 10.1016/j.eeh.2022.101494
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0014498322000729
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.eeh.2022.101494?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bryan Kelly & Dimitris Papanikolaou & Amit Seru & Matt Taddy, 2021. "Measuring Technological Innovation over the Long Run," American Economic Review: Insights, American Economic Association, vol. 3(3), pages 303-320, September.
    2. Stephen Hansen & Tejas Ramdas & Raffaella Sadun & Joe Fuller, 2021. "The Demand for Executive Skills," NBER Working Papers 28959, National Bureau of Economic Research, Inc.
    3. Gerard Hoberg & Gordon Phillips, 2016. "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy, University of Chicago Press, vol. 124(5), pages 1423-1465.
    4. David Autor & Caroline Chin & Anna M. Salomons & Bryan Seegmiller, 2022. "New Frontiers: The Origins and Content of New Work, 1940–2018," NBER Working Papers 30389, National Bureau of Economic Research, Inc.
    5. Enghin Atalay & Phai Phongthiengtham & Sebastian Sotelo & Daniel Tannenbaum, 2020. "The Evolution of Work in the United States," American Economic Journal: Applied Economics, American Economic Association, vol. 12(2), pages 1-34, April.
    6. Acemoglu, Daron & Autor, David, 2011. "Skills, Tasks and Technologies: Implications for Employment and Earnings," Handbook of Labor Economics, in: O. Ashenfelter & D. Card (ed.), Handbook of Labor Economics, edition 1, volume 4, chapter 12, pages 1043-1171, Elsevier.
    7. Leonid Kogan & Dimitris Papanikolaou & Lawrence D. W. Schmidt & Bryan Seegmiller, 2021. "Technology, Vintage-Specific Human Capital, and Labor Displacement: Evidence from Linking Patents with Occupations," NBER Working Papers 29552, National Bureau of Economic Research, Inc.
    8. Barbara Biasi & Song Ma, 2022. "The Education-Innovation Gap," CESifo Working Paper Series 9653, CESifo.
    9. Barbara Biasi & Song Ma, 2022. "The Education-Innovation Gap," NBER Working Papers 29853, National Bureau of Economic Research, Inc.
    10. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Christina Langer & Simon Wiederhold, 2023. "The Value of Early-Career Skills," Working Papers 222, Bavarian Graduate Program in Economics (BGPE).
    2. John Carter Braxton & Kyle F. Herkenhoff & Jonathan Rothbaum & Lawrence Schmidt, 2021. "Changing Income Risk across the US Skill Distribution: Evidence from a Generalized Kalman Filter," Opportunity and Inclusive Growth Institute Working Papers 55, Federal Reserve Bank of Minneapolis.
    3. Bloom, Nicholas & Hassan, Tarek Alexander & Kalyani, Aakash & Lerner, Josh & Tahoun, Ahmed, 2021. "The diffusion of disruptive technologies," LSE Research Online Documents on Economics 113870, London School of Economics and Political Science, LSE Library.
    4. Marin, Giovanni & Vona, Francesco, 2023. "Finance and the reallocation of scientific, engineering and mathematical talent," Research Policy, Elsevier, vol. 52(5).
    5. Klaus Gugler & Florian Szücs & Ulrich Wohak, 2023. "Start-up Acquisitions, Venture Capital and Innovation: A Comparative Study of Google, Apple, Facebook, Amazon and Microsoft," Department of Economics Working Papers wuwp340, Vienna University of Economics and Business, Department of Economics.
    6. Hensvik, Lena & Skans, Oskar Nordström, 2023. "The skill-specific impact of past and projected occupational decline," Labour Economics, Elsevier, vol. 81(C).
    7. Antonio Martins-Neto & Nanditha Mathew & Pierre Mohnen & Tania Treibich, 2021. "Is There Job Polarization in Developing Economies? A Review and Outlook," CESifo Working Paper Series 9444, CESifo.
    8. Max Nathan & Anna Rosso, 2017. "Innovative events," Development Working Papers 429, Centro Studi Luca d'Agliano, University of Milano, revised 08 Apr 2019.
    9. Sergio Ocampo, 2019. "A task-based theory of occupations with multidimensional heterogeneity," 2019 Meeting Papers 477, Society for Economic Dynamics.
    10. David Autor & Caroline Chin & Anna M. Salomons & Bryan Seegmiller, 2022. "New Frontiers: The Origins and Content of New Work, 1940–2018," NBER Working Papers 30389, National Bureau of Economic Research, Inc.
    11. Consoli, Davide & Marin, Giovanni & Rentocchini, Francesco & Vona, Francesco, 2023. "Routinization, within-occupation task changes and long-run employment dynamics," Research Policy, Elsevier, vol. 52(1).
    12. Koomen, Miriam & Backes-Gellner, Uschi, 2022. "Occupational tasks and wage inequality in West Germany: A decomposition analysis," Labour Economics, Elsevier, vol. 79(C).
    13. Baslandze, Salomé & Argente, David & Hanley, Douglas & Moreira, Sara, 2020. "Patents to Products: Product Innovation and Firm Dynamics," CEPR Discussion Papers 14692, C.E.P.R. Discussion Papers.
    14. Qiguo Gong, 2023. "Machine endowment cost model: task assignment between humans and machines," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-8, December.
    15. Ekaterina Prytkova & Fabien Petit & Deyu Li & Sugat Chaturvedi & Tommaso Ciarli, 2024. "The Employment Impact of Emerging Digital Technologies," CEPEO Working Paper Series 24-01, UCL Centre for Education Policy and Equalising Opportunities, revised Feb 2024.
    16. repec:hal:spmain:info:hdl:2441/13fti1jo4t8vjpe6ko3qrrv2nv is not listed on IDEAS
    17. Marchand, Joseph, 2020. "Routine Tasks were Demanded from Workers during an Energy Boom," Working Papers 2020-8, University of Alberta, Department of Economics.
    18. Tyna Eloundou & Sam Manning & Pamela Mishkin & Daniel Rock, 2023. "GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models," Papers 2303.10130, arXiv.org, revised Aug 2023.
    19. Stähler, Nikolai, 2021. "The Impact of Aging and Automation on the Macroeconomy and Inequality," Journal of Macroeconomics, Elsevier, vol. 67(C).
    20. Nathan, Max & Rosso, Anna, 2022. "Innovative events: product launches, innovation and firm performance," Research Policy, Elsevier, vol. 51(1).
    21. David J Deming & Kadeem Noray, 2020. "Earnings Dynamics, Changing Job Skills, and STEM Careers," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 135(4), pages 1965-2005.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000729. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/inca/622830 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.