IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v11y2026i7p154-d1974000.html

Bridging the Gap in Arabic Legal NLP: A Novel Large-Scale Corpus and Benchmark for Domain-Adapted Summarisation-Classification

Author

Listed:
  • Omar T. Sayed

    (Computer Science Department, Faculty of Computing and Artificial Intelligence, Capital University (Formerly Helwan University), Helwan 11795, Egypt)

  • Amal E. Aboutabl

    (Computer Science Department, Faculty of Computing and Artificial Intelligence, Capital University (Formerly Helwan University), Helwan 11795, Egypt)

  • Amr S. Ghoneim

    (Computer Science Department, Faculty of Computing and Artificial Intelligence, Capital University (Formerly Helwan University), Helwan 11795, Egypt)

Abstract

Significant progress in legal natural language processing (NLP) has enabled advancements in tasks such as legal judgment prediction, case retrieval, and question answering. However, the development of analogous technologies for Arabic legal texts remains severely constrained by the scarcity of large-scale, publicly available benchmarks for summarisation and classification. This paper addresses this gap by introducing a novel, comprehensive dataset of 9699 Arabic legal cases sourced from the Saudi Board of Grievances. This corpus is unique in pairing full-length court decisions with expertly human-crafted abstractive summaries and multi-class category labels (Administrative, Commercial, and Criminal), establishing a dedicated benchmark for Arabic legal NLP. The dataset was constructed via a robust, reproducible pipeline that ensures high textual fidelity, incorporating specialised optical character recognition (OCR) via Google Document AI and precise structural segmentation into facts, reasons, and summaries. To establish robust baselines, we conduct an extensive empirical evaluation of seven summarisation models—encompassing four extractive algorithms (TextRank, LexRank, Latent Semantic Analysis, and Luhn) and three transformer-based abstractive architectures (AraT5v2, AraBART, and mBART)—each evaluated in both base and fine-tuned configurations. Results across ROUGE, BERTScore, BLEU metrics and human evaluation demonstrate substantial performance gains achieved through domain-specific fine-tuning, with the fine-tuned AraBART model achieving the strongest performance among all evaluated models. Furthermore, we present a novel analysis of the downstream utility of generated summaries by evaluating their performance on legal category classification using five machine learning models. This investigation reveals a strong positive correlation between summarisation quality and classification accuracy, empirically demonstrating that domain-adapted abstractive summarisation not only enhances intrinsic evaluation scores but also significantly boosts extrinsic task performance. By providing this essential dataset and comprehensive benchmarking, our work contributes a much-needed resource to the field, facilitating future research and innovations in Arabic legal text analysis.

Suggested Citation

  • Omar T. Sayed & Amal E. Aboutabl & Amr S. Ghoneim, 2026. "Bridging the Gap in Arabic Legal NLP: A Novel Large-Scale Corpus and Benchmark for Domain-Adapted Summarisation-Classification," Data, MDPI, vol. 11(7), pages 1-24, June.
  • Handle: RePEc:gam:jdataj:v:11:y:2026:i:7:p:154-:d:1974000
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/11/7/154/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/11/7/154/
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:11:y:2026:i:7:p:154-:d:1974000. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager The email address of this maintainer does not seem to be valid anymore. Please ask MDPI Indexing Manager to update the entry or send us the correct address (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.