IDEAS home Printed from https://ideas.repec.org/p/hal/journl/hal-03947820.html
   My bibliography  Save this paper

A French Corpus for Event Detection on Twitter

Author

Listed:
  • Béatrice Mazoyer

    (médialab - médialab (Sciences Po) - Sciences Po - Sciences Po)

  • Julia Cagé

    (ECON - Département d'économie (Sciences Po) - Sciences Po - Sciences Po - CNRS - Centre National de la Recherche Scientifique)

  • Nicolas Hervé

    (INA - Institut National de l'Audiovisuel)

  • Céline Hudelot

    (MICS - Mathématiques et Informatique pour la Complexité et les Systèmes - CentraleSupélec - Université Paris-Saclay)

Abstract

We present Event2018, a corpus annotated for event detection tasks, consisting of 38 million tweets in French (retweets excluded) including more than 130,000 tweets manually annotated by three annotators as related or unrelated to a given event. The 257 events were selected both from press articles and from subjects trending on Twitter during the annotation period (July to August 2018). In total, more than 95,000 tweets were annotated as related to one of the selected events. We also provide the titles and URLs of 15,500 news articles automatically detected as related to these events. In addition to this corpus, we detail the results of our event detection experiments on both this dataset and another publicly available dataset of tweets in English. We ran extensive tests with different types of text embeddings and a standard Topic Detection and Tracking algorithm, and detail our evaluation method. We show that tf-idf vectors allow the best performance for this task on both corpora. These results are intended to serve as a baseline for researchers wishing to test their own event detection systems on our corpus.

Suggested Citation

  • Béatrice Mazoyer & Julia Cagé & Nicolas Hervé & Céline Hudelot, 2020. "A French Corpus for Event Detection on Twitter," Post-Print hal-03947820, HAL.
  • Handle: RePEc:hal:journl:hal-03947820
    Note: View the original document on HAL open archive server: https://sciencespo.hal.science/hal-03947820
    as

    Download full text from publisher

    File URL: https://sciencespo.hal.science/hal-03947820/document
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Julia Cagé & Nicolas Hervé & Marie-Luce Viaud, 2020. "The Production of Information in an Online World," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 87(5), pages 2126-2164.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Stephanie L. Chan, 2021. "The Social Value of Public Information When Not Everyone is Privately Informed," Working Papers 2021-09-18, Wang Yanan Institute for Studies in Economics (WISE), Xiamen University.
    2. Joan Calzada & Nestor Duch-Brown & Ricard Gil, 2021. "Do search engines increase concentration in media markets?," UB School of Economics Working Papers 2021/415, University of Barcelona School of Economics.
    3. Bertin Martens & Luis Aguiar & Estrella Gomez Herrera & Frank Muller, 2018. "The digital transformation of news media and the rise of disinformation and fake news," JRC Working Papers on Digital Economy 2018-02, Joint Research Centre.
    4. Han, Xintong & Li, Yushen & Wang, Tong, 2023. "Peer recognition, badge policies, and content contribution: An empirical study," Journal of Economic Behavior & Organization, Elsevier, vol. 214(C), pages 691-707.
    5. García-Uribe, Sandra, 2022. "Multidimensional media slant: Complementarities in news reporting by US newspapers," Information Economics and Policy, Elsevier, vol. 61(C).
    6. Charles Angelucci & Julia Cage & Michael Sinkinson, 2020. "Media Competition and News Diets," SciencePo Working papers Main hal-03393063, HAL.
    7. Julia Cagé & Moritz Hengel & Nicolas Hervé & Camille Urvoy, 2022. "Hosting Media Bias: Evidence from the Universe of French Broadcasts, 2002-2020," SciencePo Working papers Main hal-03878119, HAL.
    8. Boxell, Levi & Steinert-Threlkeld, Zachary, 2022. "Taxing dissent: The impact of a social media tax in Uganda," World Development, Elsevier, vol. 158(C).
    9. Choi, Jay Pil & Yang, Sangwoo, 2021. "Investigative journalism and media capture in the digital age," Information Economics and Policy, Elsevier, vol. 57(C).
    10. Christian Peukert & Margaritha Windisch, 2023. "The Economics of Copyright in the Digital Age," CESifo Working Paper Series 10687, CESifo.
    11. Bisceglia, Michele, 2023. "The unbundling of journalism," European Economic Review, Elsevier, vol. 158(C).
    12. Louis-Sidois, Charles & Mougin, Elisa, 2023. "Silence the media or the story? Theory and evidence of media capture," European Economic Review, Elsevier, vol. 158(C).
    13. Joan Calzada & Ricard Gil, 2020. "What Do News Aggregators Do? Evidence from Google News in Spain and Germany," Marketing Science, INFORMS, vol. 39(1), pages 134-167, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:journl:hal-03947820. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.