Author
Listed:
- Hermilo Santiago-Benito
(Facultad de Informática, Universidad Autónoma de Querétaro, Av. de las Ciencias S/N, Campus Juriquilla, Querétaro 76230, Mexico)
- Diana-Margarita Córdova-Esparza
(Facultad de Informática, Universidad Autónoma de Querétaro, Av. de las Ciencias S/N, Campus Juriquilla, Querétaro 76230, Mexico)
- Juan Terven
(Centro de Investigación en Ciencia Aplicada y Tecnología Avanzada—Unidad Querétaro, Instituto Politécnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Querétaro 76090, Mexico)
- Noé-Alejandro Castro-Sánchez
(Centro Nacional de Investigación y Desarrollo Tecnológico, Tecnológico Nacional de México, Interior Internado Palmira S/N, Palmira, Cuernavaca 62493, Mexico)
- Teresa García-Ramirez
(Centro de Investigación en Ciencia Aplicada y Tecnología Avanzada—Unidad Querétaro, Instituto Politécnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Querétaro 76090, Mexico)
- Julio-Alejandro Romero-González
(Centro de Investigación en Ciencia Aplicada y Tecnología Avanzada—Unidad Querétaro, Instituto Politécnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Querétaro 76090, Mexico)
- José M. Álvarez-Alvarado
(Facultad de Ingeniería, Universidad Autónoma de Querétaro, Querétaro 76010, Mexico)
Abstract
This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca (Western Coast, Southern Lowland, Santa María Yosoyúa, Central, Lower Cañada, Western Central, San Antonio Huitepec, Upper Western, and Southwestern Central). Texts are classified into four main domains as follows: education, law, health, and religion. To compile these data, we conducted a two-phase collection process as follows: first, an online search of government portals, religious organizations, and Mixtec language blogs; and second, an on-site retrieval of physical texts from the library of the Autonomous University of Querétaro. Scanning and optical character recognition were then performed to digitize physical materials, followed by manual correction to fix character misreadings and remove duplicates or irrelevant segments. We conducted a preliminary evaluation of the collected data to validate its usability in automatic translation systems. From Spanish to Mixtec, a fine-tuned GPT-4o-mini model yielded a BLEU score of 0.22 and a TER score of 122.86, while two fine-tuned open source models mBART-50 and M2M-100 yielded BLEU scores of 4.2 and 2.63 and TER scores of 98.99 and 104.87, respectively. All code demonstrating data usage, along with the final corpus itself, is publicly accessible via GitHub and Figshare. We anticipate that this resource will enable further research into machine translation, speech recognition, and other NLP applications while contributing to the broader goal of preserving and revitalizing the Mixtec language.
Suggested Citation
Hermilo Santiago-Benito & Diana-Margarita Córdova-Esparza & Juan Terven & Noé-Alejandro Castro-Sánchez & Teresa García-Ramirez & Julio-Alejandro Romero-González & José M. Álvarez-Alvarado, 2025.
"Mixtec–Spanish Parallel Text Dataset for Language Technology Development,"
Data, MDPI, vol. 10(7), pages 1-15, June.
Handle:
RePEc:gam:jdataj:v:10:y:2025:i:7:p:94-:d:1684415
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:10:y:2025:i:7:p:94-:d:1684415. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.