IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v8y2023i5p89-d1145798.html
   My bibliography  Save this article

A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language

Author

Listed:
  • Gordan Gledec

    (Department of Applied Computing, Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR-10000 Zagreb, Croatia)

  • Marko Horvat

    (Department of Applied Computing, Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR-10000 Zagreb, Croatia)

  • Miljenko Mikuc

    (Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR-10000 Zagreb, Croatia)

  • Bruno Blašković

    (Department of Electrical Engineering Fundamentals and Measurements, Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR-10000 Zagreb, Croatia)

Abstract

This paper presents a unique and extensive dataset containing over 33 million entries with pairs in the form “spelling error → correction” from ispravi.me, the most popular Croatian online spellchecking service, collected since 2008. The dataset, compiled from the contribution of nearly 900,000 users, is a valuable resource for researchers and developers in the field of natural language processing (NLP), improving spellcheck accuracy, and language learning applications. The dataset may be used to accomplish several goals: (1) improving spellchecking accuracy by incorporating common user corrections and reducing false positives and negatives; (2) helping language learners identify common errors and learn correct spelling through targeted feedback; (3) analyzing data trends and patterns to uncover the most common spelling errors and their underlying causes; (4) identifying and evaluating factors that influence typing input; (5) improving NLP applications such as text recognition and machine translation. Tasks specific to the Croatian language include the creation of a letter-level confusion matrix and the refinement of word suggestions based on historical usage of the service. This comprehensive dataset provides researchers and practitioners with a wealth of information, opening the path for advancements in spellchecking, language learning, and NLP applications in the Croatian language.

Suggested Citation

  • Gordan Gledec & Marko Horvat & Miljenko Mikuc & Bruno Blašković, 2023. "A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language," Data, MDPI, vol. 8(5), pages 1-11, May.
  • Handle: RePEc:gam:jdataj:v:8:y:2023:i:5:p:89-:d:1145798
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/8/5/89/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/8/5/89/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:8:y:2023:i:5:p:89-:d:1145798. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.