IDEAS home Printed from https://ideas.repec.org/p/hig/wpaper/35hum2013.html
   My bibliography  Save this paper

Frequency dictionary of inflectional paradigms: core Russian vocabulary

Author

Listed:
  • Olga Lyashevskaya

    (National Research University Higher School of Economics, Moscow / University of Helsinki)

Abstract

A new kind of frequency dictionary is a valuable reference for researchers and students of Russian. It shows the grammatical profiles of nouns, adjectives, and verbs, namely the distribution of grammatical forms in the inflectional paradigm. The dictionary is based on data from the Russian National Corpus (RNC) and covers a core vocabulary (5,000 most frequently used lexemes). Russian is a morphologically rich language: its noun paradigms harbor two dozen case and number forms, while verb paradigms include up to 160 grammatical forms. The dictionary departs from traditional frequency lexicography in several ways: 1) word forms are arranged in paradigms, so their frequencies can be compared and ranked; 2) the dictionary is focused on the grammatical profiles of individual lexemes, rather than on the overall distribution of grammatical features (e.g., the fact that Future forms are used less frequently than Past forms); 3) the grammatical profiles of lexical units can be compared against the mean scores of their lexico-semantic class; 4) in each part of speech or semantic class, lexemes with certain biases in the grammatical profile can be easily detected (e.g. verbs used mostly in the Imperative, Past neutral, or nouns often used in the plural); and, 5) the distribution of homonymous word forms and grammatical variants can be followed over time and within certain genres and registers. The dictionary will be a source for research in the field of Russian grammar, paradigm structure, form acquisition, grammatical semantics, as well as variation of grammatical forms. The main challenge for this initiative is the intra-paradigm and inter-paradigm homonymy of word forms in the corpus data. Manual disambiguation is accurate but covers approximately five million words in the RNC, so the data may be sparse and possibly unreliable. Automatic disambiguation yields slightly worse results. However, a larger corpus shows more reliable data for rare word forms. A user can switch between a ?basic? version, which is based on a smaller collection of manually disambiguated texts, and an ?expanded? version, which is based on the main corpus, a newspaper corpus, a corpus of poetry, and the spoken corpus (320 million words in total). The article addresses some general issues, such as establishing the common basis of comparison, a level of granularity for the grammatical profile, and units of measurement. We suggest certain solutions related to the selection of data, corpus data processing, and maintaining the online version of the frequency dictionary

Suggested Citation

  • Olga Lyashevskaya, 2013. "Frequency dictionary of inflectional paradigms: core Russian vocabulary," HSE Working papers WP BRP 35/HUM/2013, National Research University Higher School of Economics.
  • Handle: RePEc:hig:wpaper:35hum2013
    as

    Download full text from publisher

    File URL: http://www.hse.ru/data/2013/06/27/1285976210/35HUM2013.pdf
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    frequency dictionary; grammatical profile; inflection; grammatical homonymy; grammatical variation; Russian; Russian National Corpus;
    All these keywords.

    JEL classification:

    • Z - Other Special Topics

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hig:wpaper:35hum2013. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Shamil Abdulaev or Shamil Abdulaev (email available below). General contact details of provider: https://edirc.repec.org/data/hsecoru.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.