IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v11y2019i5p114-d230438.html
   My bibliography  Save this article

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

Author

Listed:
  • Korawit Orkphol

    (Information Security Research Center, College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China)

  • Wu Yang

    (Information Security Research Center, College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China)

Abstract

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

Suggested Citation

  • Korawit Orkphol & Wu Yang, 2019. "Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet," Future Internet, MDPI, vol. 11(5), pages 1-16, May.
  • Handle: RePEc:gam:jftint:v:11:y:2019:i:5:p:114-:d:230438
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/11/5/114/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/11/5/114/
    Download Restriction: no
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Gugulica, Madalina & Burghardt, Dirk, 2023. "Mapping indicators of cultural ecosystem services use in urban green spaces based on text classification of geosocial media data," Ecosystem Services, Elsevier, vol. 60(C).
    2. Ana Laura Lezama-Sánchez & Mireya Tovar Vidal & José A. Reyes-Ortiz, 2022. "An Approach Based on Semantic Relationship Embeddings for Text Classification," Mathematics, MDPI, vol. 10(21), pages 1-15, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:11:y:2019:i:5:p:114-:d:230438. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.