IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v8y2023i11p170-d1277877.html
   My bibliography  Save this article

Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Author

Listed:
  • Sascha Wolfer

    (Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany)

  • Alexander Koplenig

    (Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany)

  • Marc Kupietz

    (Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany)

  • Carolin Müller-Spitzer

    (Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany)

Abstract

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

Suggested Citation

  • Sascha Wolfer & Alexander Koplenig & Marc Kupietz & Carolin Müller-Spitzer, 2023. "Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German," Data, MDPI, vol. 8(11), pages 1-10, November.
  • Handle: RePEc:gam:jdataj:v:8:y:2023:i:11:p:170-:d:1277877
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/8/11/170/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/8/11/170/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:8:y:2023:i:11:p:170-:d:1277877. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.