IDEAS home Printed from https://ideas.repec.org/a/pal/palcom/v2y2016i1d10.1057_palcomms.2016.10.html
   My bibliography  Save this article

Race, religion and the city: twitter word frequency patterns reveal dominant demographic dimensions in the United States

Author

Listed:
  • Eszter Bokányi

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary)

  • Dániel Kondor

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary
    SENSEable City Laboratory, Massachusetts Institute of Technology, Cambridge, USA)

  • László Dobos

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary)

  • Tamás Sebők

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary)

  • József Stéger

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary)

  • István Csabai

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary)

  • Gábor Vattay

    (Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary)

Abstract

Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities—from predicting individuals’ demographics and health status to their beliefs and political opinions—all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in OSN content—that is, what are the relevant aspects that constitute detectable large-scale patterns in language? Here, we study language use in the United States using a corpus of text compiled from over half a billion geotagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis augmented with the Robust Principal Component Analysis methodology, which permits identification of the data’s main sources of variation with an automatic filtering of noise and outliers without influencing results by a priori assumptions. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Apart from the standard measure of linear correlation, some relations seem to be better explained by Boolean implications, suggesting a threshold-like behaviour where demographic variables influence the users’ word use. Our findings validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. They therefore could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns identified here.

Suggested Citation

  • Eszter Bokányi & Dániel Kondor & László Dobos & Tamás Sebők & József Stéger & István Csabai & Gábor Vattay, 2016. "Race, religion and the city: twitter word frequency patterns reveal dominant demographic dimensions in the United States," Palgrave Communications, Palgrave Macmillan, vol. 2(1), pages 1-9, December.
  • Handle: RePEc:pal:palcom:v:2:y:2016:i:1:d:10.1057_palcomms.2016.10
    DOI: 10.1057/palcomms.2016.10
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1057/palcomms.2016.10
    File Function: Abstract
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1057/palcomms.2016.10?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Carlo Corradini & Emma Folmer & Anna Rebmann, 2022. "Listening to the buzz: Exploring the link between firm creation and regional innovative atmosphere as reflected by social media," Environment and Planning A, , vol. 54(2), pages 347-369, March.
    2. Till Koebe & Alejandra Arias-Salazar & Timo Schmid, 2023. "Releasing survey microdata with exact cluster locations and additional privacy safeguards," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-13, December.
    3. Chu-Ren Huang & Sicong Dong & Yike Yang & He Ren, 2021. "From language to meteorology: kinesis in weather events and weather verbs across Sinitic languages," Palgrave Communications, Palgrave Macmillan, vol. 8(1), pages 1-13, December.
    4. Fabio Lamanna & Maxime Lenormand & María Henar Salas-Olmedo & Gustavo Romanillos & Bruno Gonçalves & José J Ramasco, 2018. "Immigrant community integration in world cities," PLOS ONE, Public Library of Science, vol. 13(3), pages 1-19, March.
    5. Li Ying & Li Linlin & Li Qianqian, 2022. "The clues in the news media coverage: detecting Chinese collective action trend from a text analytics research framework," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(2), pages 729-749, April.
    6. Nirmalya Thakur & Kesha A. Patel & Audrey Poon & Rishika Shah & Nazif Azizi & Changhee Han, 2023. "A Comprehensive Analysis and Investigation of the Public Discourse on Twitter about Exoskeletons from 2017 to 2023," Future Internet, MDPI, vol. 15(10), pages 1-46, October.
    7. Mattia Mazzoli & Boris Diechtiareff & Antònia Tugores & Willian Wives & Natalia Adler & Pere Colet & José J Ramasco, 2020. "Migrant mobility flows characterized with digital data," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-20, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:pal:palcom:v:2:y:2016:i:1:d:10.1057_palcomms.2016.10. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: https://www.nature.com/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.