Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

Author

Listed:

Klaus Lehmann
(Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)
Elio Villaseñor
(Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico)
Alejandro Pimentel
(Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico)
Javiera Preuss
(Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)
Nicolás Berhó
(Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)
Oswaldo Diaz
(Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico)
Ignacio Agloni
(Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)

Abstract

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices.

Suggested Citation

Klaus Lehmann & Elio Villaseñor & Alejandro Pimentel & Javiera Preuss & Nicolás Berhó & Oswaldo Diaz & Ignacio Agloni, 2025. "Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics," Stats, MDPI, vol. 8(3), pages 1-22, July.

Handle: RePEc:gam:jstats:v:8:y:2025:i:3:p:68-:d:1713855

Download full text from publisher

References listed on IDEAS

Christine Oehlert & Evan Schulz & Anne Parker, 2022. "NAICS Code Prediction Using Supervised Methods," Statistics and Public Policy, Taylor & Francis Journals, vol. 9(1), pages 58-66, December.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Simerjot Kaur & Andrea Stefanucci & Sameena Shah, 2023. "InProC: Industry and Product/Service Code Classification," Papers 2305.13532, arXiv.org.

More about this item

Keywords

; ; ; ; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jstats:v:8:y:2025:i:3:p:68-:d:1713855. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data