IDEAS home Printed from https://ideas.repec.org/p/hal/journl/hal-01648487.html
   My bibliography  Save this paper

L'analyse lexicale au service de la cliodynamique : traitement par intelligence artificielle de la base Google Ngram

Author

Listed:
  • Jérôme Baray

    (IRG - Institut de Recherche en Gestion - UPEM - Université Paris-Est Marne-la-Vallée - UPEC UP12 - Université Paris-Est Créteil Val-de-Marne - Paris 12)

  • Albert da Silva
  • Jean-Marc Leblanc

    (CEDITEC - Centre d'Etudes des discours, Images, Textes, Ecrits, Communications - UPEC UP12 - Université Paris-Est Créteil Val-de-Marne - Paris 12)

Abstract

Cliodynamics is a fairly recent research field that considers history as an object of scientific study. Thanks to its transdisciplinary nature, cliodynamics tries to explain historical dynamical processes such as the rise or collapse of empires or civilizations, economic cycles, population booms, fashions through mathematical modeling, datamining, econometrics or cultural sociology. "Big data" aggregating historical, archaeological or economic informations is the material to feed these quantitative models. It can also incluse empirical analysis to validate assumptions and predictions of dynamic models using historical data. Cliodynamics is part of the cliometrics approach or "new economic history" which studies history through econometrics. Objectives On the one hand, we designed a robust lexical analysis method able to deal with a very large dated corpus series whose content evolves over time (big data) with the challenge of identifying societal evolutions and major historical periods in a cliodynamics perspective. Lexical analysis also examined the teachings to be learned from the Google books Ngram database, which details the number of annual words occurrences in scanned publications available in the Google Books search engine . It is assumed that this database has compiled about 20% of all books ever published in major languages. We focused our study on English-language books published in the United States and Great Britain. The objective was to identify the words frequencies evolving from year 1860 to 2008. Method Principles The method was to constitute, as a first step, a dictionary of the most commonly used English words, disregarding two-way terms, preposition, articles, pronouns. This dictionary has collected 1592 words covering many aspects of social and cultural life with terms related to politics, religion, arts and sciences, industry, objects, family and sentiments. In a second step, the percentage representation of each word in the dictionary was determined for each year after loading the huge Ngram Google Books (1-gram) database on Postgresql. Some words like "king" or "queen" are very well represented in the 19th century dictionary with the reign and power of royalties in Europe, but the use of these phrases declined in the 20th century. The words frequency in books is constantly evolving as time goes by. The third step was to perform a centered and standardized principal component analysis (PCA) on the table describing the representation of words in % by years from 1860 to 2008. A clustering of "years" is carried out using a neural network (artificial intelligence Kohonen map). The results show 8 different periods in history according to 3 different major tendancies in speeches : Humanist versus Scientific ; Chaos versus Organization ; Individualist versus Collectivist.

Suggested Citation

  • Jérôme Baray & Albert da Silva & Jean-Marc Leblanc, 2017. "L'analyse lexicale au service de la cliodynamique : traitement par intelligence artificielle de la base Google Ngram," Post-Print hal-01648487, HAL.
  • Handle: RePEc:hal:journl:hal-01648487
    Note: View the original document on HAL open archive server: https://hal.science/hal-01648487
    as

    Download full text from publisher

    File URL: https://hal.science/hal-01648487/document
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    Google Ngram; intelligence artificielle; big data; cliodynamique; analyse lexicale;
    All these keywords.

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:journl:hal-01648487. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.