IDEAS home Printed from https://ideas.repec.org/a/sae/somere/v51y2022i2p887-925.html
   My bibliography  Save this article

Disambiguating and Specifying Social Actors in Big Data: Using Wikipedia as a Data Source for Demographic Information

Author

Listed:
  • Philipp Poschmann
  • Jan Goldenstein

Abstract

Despite the recent and ongoing progress in using text-mining tools to automatically analyze large text corpora, there remains significant potential to facilitate the study of social action in social science research. In this context, particularly the disambiguation (who is referred to in a text?) and specification (which demographic characteristics are present?) of social actors—currently a manual job—remains a challenge. This article demonstrates a reliable and accurate software architecture for social scientists who are interested in automatically detecting, disambiguating, and demographically specifying social actors (i.e., persons and organizations) in large text collections. The backbone of our software architecture is the online encyclopedia Wikipedia as a currently unexploited data source of a large amount of accurately prepared information. We illustrate how our software architecture detects and disambiguates social actors in large text corpora and retrieves their respective demographic information. Overall, we evaluate the reliability and accuracy of our software architecture across seven different social settings and facilitate an intuitive sense of the comprehensive applicability of our software architecture. We end by not only highlighting the benefits of our software architecture for social science research but also pointing to the limitations of using Wikipedia as a data source.

Suggested Citation

  • Philipp Poschmann & Jan Goldenstein, 2022. "Disambiguating and Specifying Social Actors in Big Data: Using Wikipedia as a Data Source for Demographic Information," Sociological Methods & Research, , vol. 51(2), pages 887-925, May.
  • Handle: RePEc:sae:somere:v:51:y:2022:i:2:p:887-925
    DOI: 10.1177/0049124119882481
    as

    Download full text from publisher

    File URL: https://journals.sagepub.com/doi/10.1177/0049124119882481
    Download Restriction: no

    File URL: https://libkey.io/10.1177/0049124119882481?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Matthew Gentzkow & Jesse M. Shapiro, 2010. "What Drives Media Slant? Evidence From U.S. Daily Newspapers," Econometrica, Econometric Society, vol. 78(1), pages 35-71, January.
    2. Jim Giles, 2005. "Internet encyclopaedias go head to head," Nature, Nature, vol. 438(7070), pages 900-901, December.
    3. van Atteveldt, Wouter & Kleinnijenhuis, Jan & Ruigrok, Nel, 2008. "Parsing, Semantic Networks, and Political Authority Using Syntactic Analysis to Extract Semantic Relations from Dutch Newspaper Articles," Political Analysis, Cambridge University Press, vol. 16(4), pages 428-446.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shane Greenstein & Grace Gu & Feng Zhu, 2021. "Ideology and Composition Among an Online Crowd: Evidence from Wikipedians," Management Science, INFORMS, vol. 67(5), pages 3067-3086, May.
    2. Shane Greenstein & Feng Zhu, 2016. "Open Content, Linus’ Law, and Neutral Point of View," Information Systems Research, INFORMS, vol. 27(3), pages 618-635.
    3. Bernhardt, Lea & Dewenter, Ralf & Thomas, Tobias, 2023. "Measuring partisan media bias in US newscasts from 2001 to 2012," European Journal of Political Economy, Elsevier, vol. 78(C).
    4. Mueller, Hannes & Rauh, Christopher, 2018. "Reading Between the Lines: Prediction of Political Violence Using Newspaper Text," American Political Science Review, Cambridge University Press, vol. 112(2), pages 358-375, May.
    5. Bennani, Hamza, 2018. "Media coverage and ECB policy-making: Evidence from an augmented Taylor rule," Journal of Macroeconomics, Elsevier, vol. 57(C), pages 26-38.
    6. McCannon, Bryan & Zhou, Yang & Hall, Joshua, 2021. "Measuring a Contract’s Breadth: A Text Analysis," Working Papers 11013, George Mason University, Mercatus Center.
    7. Wei Luo & Julia Adams & Hannah Brueckner, 2018. "The Ladies Vanish? American Sociology and the Genealogy of its Missing Women on Wikipedia," Working Papers 20180012, New York University Abu Dhabi, Department of Social Science, revised Jan 2018.
    8. Giovanni Facchini & Anna Maria Mayda & Riccardo Puglisi, 2017. "Illegal immigration and media exposure: evidence on individual attitudes," IZA Journal of Migration and Development, Springer;Forschungsinstitut zur Zukunft der Arbeit GmbH (IZA), vol. 7(1), pages 1-36, December.
    9. Pal Sudeshna, 2011. "Media Freedom and Socio-Political Instability," Peace Economics, Peace Science, and Public Policy, De Gruyter, vol. 17(1), pages 1-23, March.
    10. Aaltonen, Aleksi Ville & Seiler, Stephan, 2014. "Quantifying spillovers in open source content production: evidence from Wikipedia," LSE Research Online Documents on Economics 60284, London School of Economics and Political Science, LSE Library.
    11. Munday, Tim & Brookes, James, 2021. "Mark my words: the transmission of central bank communication to the general public via the print media," Bank of England working papers 944, Bank of England.
    12. Federico Boffa & Amedeo Piolatto & Giacomo A. M. Ponzetto, 2016. "Political Centralization and Government Accountability," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(1), pages 381-422.
    13. Shane Greenstein & Yuan Gu & Feng Zhu, 2016. "Ideological Segregation among Online Collaborators: Evidence from Wikipedians," Harvard Business School Working Papers 17-028, Harvard Business School, revised Mar 2017.
    14. Hong, T., 2021. "Revisiting the Trade Policy Uncertainty Index," Cambridge Working Papers in Economics 2174, Faculty of Economics, University of Cambridge.
    15. Charles Ayoubi & Boris Thurm, 2023. "Knowledge diffusion and morality: Why do we freely share valuable information with Strangers?," Journal of Economics & Management Strategy, Wiley Blackwell, vol. 32(1), pages 75-99, January.
    16. Dewenter, Ralf & Dulleck, Uwe & Thomas, Tobias, 2018. "The political coverage index and its application to government capture," Research Papers 6, EcoAustria – Institute for Economic Research.
    17. Shapiro, Jesse M., 2016. "Special interests and the media: Theory and an application to climate change," Journal of Public Economics, Elsevier, vol. 144(C), pages 91-108.
    18. Demidov, Denis & Frahm, Klaus M. & Shepelyansky, Dima L., 2020. "What is the central bank of Wikipedia?," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 542(C).
    19. Larsen, Vegard H. & Thorsrud, Leif Anders & Zhulanova, Julia, 2021. "News-driven inflation expectations and information rigidities," Journal of Monetary Economics, Elsevier, vol. 117(C), pages 507-520.
    20. Martin Baumgaertner & Johannes Zahner, 2021. "Whatever it takes to understand a central banker - Embedding their words using neural networks," MAGKS Papers on Economics 202130, Philipps-Universität Marburg, Faculty of Business Administration and Economics, Department of Economics (Volkswirtschaftliche Abteilung).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:sae:somere:v:51:y:2022:i:2:p:887-925. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: SAGE Publications (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.