IDEAS home Printed from https://ideas.repec.org/a/sae/sagope/v15y2025i2p21582440251333182.html
   My bibliography  Save this article

Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora

Author

Listed:
  • Liu Lei
  • Gong Tongxi
  • Shi Jianjun
  • Guo Yi

Abstract

There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have hindered this goal. This study addresses this challenge by leveraging a large language model, such as BERT, to semantically annotate large corpora and identify the high-frequency senses of headwords from the General Service List (GSL). We aim to explore three key questions: (1) Can BERT automatically annotate large corpora and accurately calculate sense frequencies? (2) What are the high-frequency senses of GSL words? (3) Can this approach be verified? Using a BERT-based framework, we annotated 1,891 GSL headwords (10,925 senses) in the 100-million-word British National Corpus (BNC), representing each sense with a 1,024-dimensional vector. From this, we identified 3,695 high-frequency senses for the GSL words. Three main conclusions are drawn from this study. First, BERT demonstrates high accuracy in sense annotation, achieving 92% precision when disambiguating the senses of GSL words. Second, a relatively small number of high-frequency senses account for a significant portion of corpus coverage. Specifically, these high-frequency senses (33.8% of the total) cover approximately 60% of all GSL word occurrences in the BNC. Third, the high-frequency senses selected via this method can be verified by their consistent coverage across different corpora. This study illustrates a pioneering method for semantic annotation in large corpora, which can be easily applied to calculate semantic frequencies for other word lists.

Suggested Citation

  • Liu Lei & Gong Tongxi & Shi Jianjun & Guo Yi, 2025. "Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora," SAGE Open, , vol. 15(2), pages 21582440251, April.
  • Handle: RePEc:sae:sagope:v:15:y:2025:i:2:p:21582440251333182
    DOI: 10.1177/21582440251333182
    as

    Download full text from publisher

    File URL: https://journals.sagepub.com/doi/10.1177/21582440251333182
    Download Restriction: no

    File URL: https://libkey.io/10.1177/21582440251333182?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:sae:sagope:v:15:y:2025:i:2:p:21582440251333182. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: SAGE Publications (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.