IDEAS home Printed from https://ideas.repec.org/p/osf/socarx/htnej.html
   My bibliography  Save this paper

Three Families of Automated Text Analysis

Author

Listed:
  • van Loon, Austin

Abstract

Since the beginning of this millennium, data in the form of human-generated text in a machine-readable format has become increasingly available to social scientists, presenting a unique window into social life. However, harnessing vast quantities of this highly unstructured data in a systematic way presents a unique combination of analytical and methodological challenges. Luckily, our understanding of how to overcome these challenges has also developed greatly over this same period. In this article, I present a novel typology of the methods social scientists have used to analyze text data at scale in the interest of testing and developing social theory. I describe three “families” of methods: analyses of (1) term frequency, (2) document structure, and (3) semantic similarity. For each family of methods, I discuss their logical and statistical foundations, analytical strengths and weaknesses, as well as prominent variants and applications.

Suggested Citation

  • van Loon, Austin, 2022. "Three Families of Automated Text Analysis," SocArXiv htnej, Center for Open Science.
  • Handle: RePEc:osf:socarx:htnej
    DOI: 10.31219/osf.io/htnej
    as

    Download full text from publisher

    File URL: https://osf.io/download/6274752dc622401fd41bfa08/
    Download Restriction: no

    File URL: https://libkey.io/10.31219/osf.io/htnej?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Rudolph, Maja & Ruiz, Francisco & Athey, Susan & Blei, David, 2017. "Structured Embedding Models for Grouped Data," Research Papers repec:ecl:stabus:3597, Stanford University, Graduate School of Business.
    2. Bill Thompson & Seán G. Roberts & Gary Lupyan, 2020. "Cultural influences on word meanings revealed through large-scale semantic alignment," Nature Human Behaviour, Nature, vol. 4(10), pages 1029-1038, October.
    3. Daniel D. Lee & H. Sebastian Seung, 1999. "Learning the parts of objects by non-negative matrix factorization," Nature, Nature, vol. 401(6755), pages 788-791, October.
    4. Kim, Sung Eun, 2018. "Media Bias against Foreign Firms as a Veiled Trade Barrier: Evidence from Chinese Newspapers," American Political Science Review, Cambridge University Press, vol. 112(4), pages 954-970, November.
    5. Margaret E. Roberts & Brandon M. Stewart & Edoardo M. Airoldi, 2016. "A Model of Text for Experimentation in the Social Sciences," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(515), pages 988-1003, July.
    6. Hackett, Edward J. & Leahey, Erin & Parker, John N. & Rafols, Ismael & Hampton, Stephanie E. & Corte, Ugo & Chavarro, Diego & Drake, John M. & Penders, Bart & Sheble, Laura & Vermeulen, Niki & Vision,, 2021. "Do synthesis centers synthesize? A semantic analysis of topical diversity in research," Research Policy, Elsevier, vol. 50(1).
    7. Matthew Gentzkow & Bryan Kelly & Matt Taddy, 2019. "Text as Data," Journal of Economic Literature, American Economic Association, vol. 57(3), pages 535-574, September.
    8. King, Gary & Zeng, Langche, 2001. "Logistic Regression in Rare Events Data," Political Analysis, Cambridge University Press, vol. 9(2), pages 137-163, January.
    9. Ban, Xuegang (Jeff) & Pang, Jong-Shi & Liu, Henry X. & Ma, Rui, 2012. "Continuous-time point-queue models in dynamic network loading," Transportation Research Part B: Methodological, Elsevier, vol. 46(3), pages 360-380.
    10. Jason W. Burton & Nicole Cruz & Ulrike Hahn, 2021. "Reconsidering evidence of moral contagion in online social networks," Nature Human Behaviour, Nature, vol. 5(12), pages 1629-1635, December.
    11. Elliott Ash & Daniel L. Chen & Sergio Galletta, 2022. "Measuring Judicial Sentiment: Methods and Application to US Circuit Courts," Economica, London School of Economics and Political Science, vol. 89(354), pages 362-376, April.
    12. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    13. David M. Blei & Alp Kucukelbir & Jon D. McAuliffe, 2017. "Variational Inference: A Review for Statisticians," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 859-877, April.
    14. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    15. Margaret Roberts & Brandon Stewart & Tingley, Dustin & Edoardo Airoldi, 2013. "The structural topic model and applied social science," Working Paper 132666, Harvard University OpenScholar.
    16. Monroe, Burt L. & Colaresi, Michael P. & Quinn, Kevin M., 2008. "Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict," Political Analysis, Cambridge University Press, vol. 16(4), pages 372-403.
    17. Laura K. Nelson, 2020. "Computational Grounded Theory: A Methodological Framework," Sociological Methods & Research, , vol. 49(1), pages 3-42, February.
    18. Goldberg, Amir & Srivastava, Sameer B & Manian, Govind & Monroe, William & Potts, Christopher, 2016. "Fitting In or Standing Out? The Tradeoffs of Structural and Cultural Embeddedness," Institute for Research on Labor and Employment, Working Paper Series qt9bf631rg, Institute of Industrial Relations, UC Berkeley.
    19. Molly Lewis & Gary Lupyan, 2020. "Gender stereotypes are reflected in the distributional structure of 25 languages," Nature Human Behaviour, Nature, vol. 4(10), pages 1021-1028, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Anna Calissano & Simone Vantini & Marika Arena, 2020. "Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 29(4), pages 787-812, December.
    2. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    3. Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
    4. Ben Cormier & Mark S. Manger, 2022. "Power, ideas, and World Bank conditionality," The Review of International Organizations, Springer, vol. 17(3), pages 397-425, July.
    5. Peter Grajzl & Peter Murrell, 2021. "Characterizing a legal–intellectual culture: Bacon, Coke, and seventeenth-century England," Cliometrica, Journal of Historical Economics and Econometric History, Association Française de Cliométrie (AFC), vol. 15(1), pages 43-88, January.
    6. Simon Fritzsch & Philipp Scharner & Gregor Weiß, 2021. "Estimating the relation between digitalization and the market value of insurers," Journal of Risk & Insurance, The American Risk and Insurance Association, vol. 88(3), pages 529-567, September.
    7. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    8. Giovanna Maria Dora Dore, 2023. "A Natural Language Processing Analysis of Newspapers Coverage of Hong Kong Protests Between 1998 and 2020," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 169(1), pages 143-166, September.
    9. Wen Shi & Diyi Liu & Jing Yang & Jing Zhang & Sanmei Wen & Jing Su, 2020. "Social Bots’ Sentiment Engagement in Health Emergencies: A Topic-Based Analysis of the COVID-19 Pandemic Discussions on Twitter," IJERPH, MDPI, vol. 17(22), pages 1-18, November.
    10. Weiss, Max & Zoorob, Michael, 2021. "Political frames of public health crises: Discussing the opioid epidemic in the US Congress," Social Science & Medicine, Elsevier, vol. 281(C).
    11. Arthur Dyevre & Nicolas Lampach, 2021. "Issue attention on international courts: Evidence from the European Court of Justice," The Review of International Organizations, Springer, vol. 16(4), pages 793-815, October.
    12. Maksym Polyakov & Morteza Chalak & Md. Sayed Iftekhar & Ram Pandit & Sorada Tapsuwan & Fan Zhang & Chunbo Ma, 2018. "Authorship, Collaboration, Topics, and Research Gaps in Environmental and Resource Economics 1991–2015," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 71(1), pages 217-239, September.
    13. Zhang, Han, 2021. "How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It," SocArXiv 453jk, Center for Open Science.
    14. Camilla Salvatore & Silvia Biffignandi & Annamaria Bianchi, 2022. "Corporate Social Responsibility Activities Through Twitter: From Topic Model Analysis to Indexes Measuring Communication Characteristics," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 164(3), pages 1217-1248, December.
    15. Martin Haselmayer & Marcelo Jenny, 2017. "Sentiment analysis of political communication: combining a dictionary approach with crowdcoding," Quality & Quantity: International Journal of Methodology, Springer, vol. 51(6), pages 2623-2646, November.
    16. Lüdering Jochen & Winker Peter, 2016. "Forward or Backward Looking? The Economic Discourse and the Observed Reality," Journal of Economics and Statistics (Jahrbuecher fuer Nationaloekonomie und Statistik), De Gruyter, vol. 236(4), pages 483-515, August.
    17. Nicolas Jouvin & Pierre Latouche & Charles Bouveyron & Guillaume Bataillon & Alain Livartowski, 2021. "Greedy clustering of count data through a mixture of multinomial PCA," Computational Statistics, Springer, vol. 36(1), pages 1-33, March.
    18. Dybowski, T.P. & Adämmer, P., 2018. "The economic effects of U.S. presidential tax communication: Evidence from a correlated topic model," European Journal of Political Economy, Elsevier, vol. 55(C), pages 511-525.
    19. Andreas Rehs, 2020. "A structural topic model approach to scientific reorientation of economics and chemistry after German reunification," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1229-1251, November.
    20. Dehler-Holland, Joris & Okoh, Marvin & Keles, Dogan, 2022. "Assessing technology legitimacy with topic models and sentiment analysis – The case of wind power in Germany," Technological Forecasting and Social Change, Elsevier, vol. 175(C).

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:osf:socarx:htnej. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: OSF (email available below). General contact details of provider: https://arabixiv.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.