IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i22p4660-d1281438.html
   My bibliography  Save this article

Transformer-Based Composite Language Models for Text Evaluation and Classification

Author

Listed:
  • Mihailo Škorić

    (Faculty of Mining and Geology, University of Belgrade, Djusina 7, 11120 Belgrade, Serbia)

  • Miloš Utvić

    (Faculty of Philology, University of Belgrade, Studentski Trg 3, 11000 Belgrade, Serbia)

  • Ranka Stanković

    (Faculty of Mining and Geology, University of Belgrade, Djusina 7, 11120 Belgrade, Serbia)

Abstract

Parallel natural language processing systems were previously successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modeling, for which they achieved significantly better results than independent methods in the cases of seven European languages. The aim of this paper is to present the advantages of using composite language models in the processing and evaluation of texts written in arbitrary highly inflective and morphology-rich natural language, particularly Serbian. A perplexity-based dataset, the main asset for the methodology assessment, was created using a series of generative pre-trained transformers trained on different representations of the Serbian language corpus and a set of sentences classified into three groups (expert translations, corrupted translations, and machine translations). The paper describes a comparative analysis of calculated perplexities in order to measure the classification capability of different models on two binary classification tasks. In the course of the experiment, we tested three standalone language models (baseline) and two composite language models (which are based on perplexities outputted by all three standalone models). The presented results single out a complex stacked classifier using a multitude of features extracted from perplexity vectors as the optimal architecture of composite language models for both tasks.

Suggested Citation

  • Mihailo Škorić & Miloš Utvić & Ranka Stanković, 2023. "Transformer-Based Composite Language Models for Text Evaluation and Classification," Mathematics, MDPI, vol. 11(22), pages 1-25, November.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:22:p:4660-:d:1281438
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/22/4660/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/22/4660/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:22:p:4660-:d:1281438. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.