IDEAS home Printed from https://ideas.repec.org/a/gam/jstats/v8y2025i3p68-d1713855.html
   My bibliography  Save this article

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

Author

Listed:
  • Klaus Lehmann

    (Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)

  • Elio Villaseñor

    (Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico)

  • Alejandro Pimentel

    (Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico)

  • Javiera Preuss

    (Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)

  • Nicolás Berhó

    (Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)

  • Oswaldo Diaz

    (Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico)

  • Ignacio Agloni

    (Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile)

Abstract

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices.

Suggested Citation

  • Klaus Lehmann & Elio Villaseñor & Alejandro Pimentel & Javiera Preuss & Nicolás Berhó & Oswaldo Diaz & Ignacio Agloni, 2025. "Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics," Stats, MDPI, vol. 8(3), pages 1-22, July.
  • Handle: RePEc:gam:jstats:v:8:y:2025:i:3:p:68-:d:1713855
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2571-905X/8/3/68/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2571-905X/8/3/68/
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jstats:v:8:y:2025:i:3:p:68-:d:1713855. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.