IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v3y2018i4p53-d185030.html
   My bibliography  Save this article

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Author

Listed:
  • Maria Mitrofan

    (Romanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, Romania
    These authors contributed equally to this work.)

  • Verginica Barbu Mititelu

    (Romanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, Romania
    These authors contributed equally to this work.)

  • Grigorina Mitrofan

    (National Institute of Diabetes and Metabolic Diseases “N.C. Paulescu”, 5-7 Ion Movilă Street, Bucharest 020475, Romania)

Abstract

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

Suggested Citation

  • Maria Mitrofan & Verginica Barbu Mititelu & Grigorina Mitrofan, 2018. "Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language," Data, MDPI, vol. 3(4), pages 1-12, November.
  • Handle: RePEc:gam:jdataj:v:3:y:2018:i:4:p:53-:d:185030
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/3/4/53/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/3/4/53/
    Download Restriction: no
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Daniela Gîfu & Diana Trandabăț & Kevin Cohen & Jingbo Xia, 2019. "Special Issue on the Curative Power of Medical Data," Data, MDPI, vol. 4(2), pages 1-4, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:3:y:2018:i:4:p:53-:d:185030. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.