IDEAS home Printed from https://ideas.repec.org/a/aac/ijirss/v8y2025i8p220-233id10583.html
   My bibliography  Save this article

Improving post-editing of Kazakh translations with fine-tuned large language models: Dataset and evaluation

Author

Listed:
  • Diana Rakhimova

  • Aliya Zhiger

  • Madina Mansurova

  • Valentin Malykh

  • XMagzhan Kairanbay

Abstract

Machine translation for low-resource languages like Kazakh faces significant challenges due to limited training data, complex morphology, and cultural-linguistic nuances. This paper presents the first comprehensive study on fine-tuning large language models for automated post-editing of Kazakh translations. We introduce KazPE, a systematically annotated dataset containing 10,010 training sentences and 315 test sentences across six domains (medical, scientific, journalistic, oral, fiction, and legal) with detailed error categorization covering 11 linguistic dimensions. Our approach fine-tunes GPT-4.1-mini using supervised learning to improve translation quality through targeted error correction. Human evaluation demonstrates that our fine-tuned model achieves a mean quality score of 0.84 compared to 0.80 for the baseline, representing a 4% relative improvement. The most significant gains occur in morphological-lexical error handling and domain-specific contexts, with legal and medical texts showing improvements of +2.8% and +1.6% respectively. Error analysis reveals that fine-tuning effectively addresses Kazakh’s agglutinative morphology and specialized terminology while maintaining performance on error-free sentences. This work establishes the first systematic evaluation framework for Kazakh translation post-editing, providing valuable insights for improving machine translation systems for morphologically rich, low-resource languages. Our dataset, models, and evaluation framework are made publicly available to support future research in Turkic language processing.

Suggested Citation

  • Diana Rakhimova & Aliya Zhiger & Madina Mansurova & Valentin Malykh & XMagzhan Kairanbay, 2025. "Improving post-editing of Kazakh translations with fine-tuned large language models: Dataset and evaluation," International Journal of Innovative Research and Scientific Studies, Innovative Research Publishing, vol. 8(8), pages 220-233.
  • Handle: RePEc:aac:ijirss:v:8:y:2025:i:8:p:220-233:id:10583
    as

    Download full text from publisher

    File URL: https://ijirss.com/index.php/ijirss/article/view/10583/2529
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:aac:ijirss:v:8:y:2025:i:8:p:220-233:id:10583. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Natalie Jean (email available below). General contact details of provider: https://ijirss.com/index.php/ijirss/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.