IDEAS home Printed from https://ideas.repec.org/a/spr/drugsa/v46y2023i8d10.1007_s40264-023-01322-3.html
   My bibliography  Save this article

Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency

Author

Listed:
  • Jürgen Dietrich

    (Bayer AG, Pharmaceuticals, Medical Affairs & Pharmacovigilance, Data Science & Insights)

  • Philipp Kazzer

    (Syncwork AG, Systems Development)

Abstract

Introduction and Objective Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance. Methods A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance. Results The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers. Conclusions A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level.

Suggested Citation

  • Jürgen Dietrich & Philipp Kazzer, 2023. "Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency," Drug Safety, Springer, vol. 46(8), pages 765-779, August.
  • Handle: RePEc:spr:drugsa:v:46:y:2023:i:8:d:10.1007_s40264-023-01322-3
    DOI: 10.1007/s40264-023-01322-3
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s40264-023-01322-3
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s40264-023-01322-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Juergen Dietrich & Lucie M. Gattepaille & Britta Anne Grum & Letitia Jiri & Magnus Lerch & Daniele Sartori & Antoni Wisniewski, 2020. "Adverse Events in Twitter-Development of a Benchmark Reference Dataset: Results from IMI WEB-RADR," Drug Safety, Springer, vol. 43(5), pages 467-478, May.
    2. Raymond Kassekert & Neal Grabowski & Denny Lorenz & Claudia Schaffer & Dieter Kempf & Promit Roy & Oeystein Kjoersvik & Griselda Saldana & Sarah ElShal, 2022. "Industry Perspective on Artificial Intelligence/Machine Learning in Pharmacovigilance," Drug Safety, Springer, vol. 45(5), pages 439-448, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.

      More about this item

      Statistics

      Access and download statistics

      Corrections

      All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:drugsa:v:46:y:2023:i:8:d:10.1007_s40264-023-01322-3. See general information about how to correct material in RePEc.

      If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

      If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

      If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

      For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com/economics/journal/40264 .

      Please note that corrections may take a couple of weeks to filter through the various RePEc services.

      IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.