IDEAS home Printed from https://ideas.repec.org/a/sae/somere/v51y2022i2p887-925.html
   My bibliography  Save this article

Disambiguating and Specifying Social Actors in Big Data: Using Wikipedia as a Data Source for Demographic Information

Author

Listed:
  • Philipp Poschmann
  • Jan Goldenstein

Abstract

Despite the recent and ongoing progress in using text-mining tools to automatically analyze large text corpora, there remains significant potential to facilitate the study of social action in social science research. In this context, particularly the disambiguation (who is referred to in a text?) and specification (which demographic characteristics are present?) of social actors—currently a manual job—remains a challenge. This article demonstrates a reliable and accurate software architecture for social scientists who are interested in automatically detecting, disambiguating, and demographically specifying social actors (i.e., persons and organizations) in large text collections. The backbone of our software architecture is the online encyclopedia Wikipedia as a currently unexploited data source of a large amount of accurately prepared information. We illustrate how our software architecture detects and disambiguates social actors in large text corpora and retrieves their respective demographic information. Overall, we evaluate the reliability and accuracy of our software architecture across seven different social settings and facilitate an intuitive sense of the comprehensive applicability of our software architecture. We end by not only highlighting the benefits of our software architecture for social science research but also pointing to the limitations of using Wikipedia as a data source.

Suggested Citation

  • Philipp Poschmann & Jan Goldenstein, 2022. "Disambiguating and Specifying Social Actors in Big Data: Using Wikipedia as a Data Source for Demographic Information," Sociological Methods & Research, , vol. 51(2), pages 887-925, May.
  • Handle: RePEc:sae:somere:v:51:y:2022:i:2:p:887-925
    DOI: 10.1177/0049124119882481
    as

    Download full text from publisher

    File URL: https://journals.sagepub.com/doi/10.1177/0049124119882481
    Download Restriction: no

    File URL: https://libkey.io/10.1177/0049124119882481?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. van Atteveldt, Wouter & Kleinnijenhuis, Jan & Ruigrok, Nel, 2008. "Parsing, Semantic Networks, and Political Authority Using Syntactic Analysis to Extract Semantic Relations from Dutch Newspaper Articles," Political Analysis, Cambridge University Press, vol. 16(4), pages 428-446.
    2. Matthew Gentzkow & Jesse M. Shapiro, 2010. "What Drives Media Slant? Evidence From U.S. Daily Newspapers," Econometrica, Econometric Society, vol. 78(1), pages 35-71, January.
    3. Jim Giles, 2005. "Internet encyclopaedias go head to head," Nature, Nature, vol. 438(7070), pages 900-901, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shane Greenstein & Grace Gu & Feng Zhu, 2021. "Ideology and Composition Among an Online Crowd: Evidence from Wikipedians," Management Science, INFORMS, vol. 67(5), pages 3067-3086, May.
    2. Shane Greenstein & Feng Zhu, 2016. "Open Content, Linus’ Law, and Neutral Point of View," Information Systems Research, INFORMS, vol. 27(3), pages 618-635.
    3. Stefano DellaVigna & Ruben Durante & Brian Knight & Eliana La Ferrara, 2016. "Market-Based Lobbying: Evidence from Advertising Spending in Italy," American Economic Journal: Applied Economics, American Economic Association, vol. 8(1), pages 224-256, January.
    4. Bernhardt, Lea & Dewenter, Ralf & Thomas, Tobias, 2020. "Measuring partisan media bias in US Newscasts from 2001-2012," Working Paper 183/2020, Helmut Schmidt University, Hamburg, revised 15 Nov 2022.
    5. Simon P. Anderson & John McLaren, 2012. "Media Mergers And Media Bias With Rational Consumers," Journal of the European Economic Association, European Economic Association, vol. 10(4), pages 831-859, August.
    6. Mueller, Hannes & Rauh, Christopher, 2018. "Reading Between the Lines: Prediction of Political Violence Using Newspaper Text," American Political Science Review, Cambridge University Press, vol. 112(2), pages 358-375, May.
    7. Bennani, Hamza, 2018. "Media coverage and ECB policy-making: Evidence from an augmented Taylor rule," Journal of Macroeconomics, Elsevier, vol. 57(C), pages 26-38.
    8. McCannon, Bryan & Zhou, Yang & Hall, Joshua, 2021. "Measuring a Contract’s Breadth: A Text Analysis," Working Papers 11013, George Mason University, Mercatus Center.
    9. Nathan, Max & Rosso, Anna, 2014. "Mapping information economy businesses with big data: findings from the UK," LSE Research Online Documents on Economics 60615, London School of Economics and Political Science, LSE Library.
    10. Ghosh, Saptarshi P. & Jain, Nidhi & Martinelli, Ćesar & Roy, Jaideep, 2023. "Responsive democracy and commercial media," Economics Letters, Elsevier, vol. 222(C).
    11. Wei Luo & Julia Adams & Hannah Brueckner, 2018. "The Ladies Vanish? American Sociology and the Genealogy of its Missing Women on Wikipedia," Working Papers 20180012, New York University Abu Dhabi, Department of Social Science, revised Jan 2018.
    12. Hungerman, Daniel & Rinz, Kevin & Weninger, Tim & Yoon, Chungeun, 2018. "Political campaigns and church contributions," Journal of Economic Behavior & Organization, Elsevier, vol. 155(C), pages 403-426.
    13. Giovanni Facchini & Anna Maria Mayda & Riccardo Puglisi, 2017. "Illegal immigration and media exposure: evidence on individual attitudes," IZA Journal of Migration and Development, Springer;Forschungsinstitut zur Zukunft der Arbeit GmbH (IZA), vol. 7(1), pages 1-36, December.
    14. Pal Sudeshna, 2011. "Media Freedom and Socio-Political Instability," Peace Economics, Peace Science, and Public Policy, De Gruyter, vol. 17(1), pages 1-23, March.
    15. Sharma, Priyanka & Wagman, Liad, 2020. "Advertising and Voter Data in Asymmetric Political Contests," Information Economics and Policy, Elsevier, vol. 52(C).
    16. Aaltonen, Aleksi Ville & Seiler, Stephan, 2014. "Quantifying spillovers in open source content production: evidence from Wikipedia," LSE Research Online Documents on Economics 60284, London School of Economics and Political Science, LSE Library.
    17. David Bholat & Stephen Hans & Pedro Santos & Cheryl Schonhardt-Bailey, 2015. "Text mining for central banks," Handbooks, Centre for Central Banking Studies, Bank of England, number 33, April.
    18. Munday, Tim & Brookes, James, 2021. "Mark my words: the transmission of central bank communication to the general public via the print media," Bank of England working papers 944, Bank of England.
    19. Federico Boffa & Amedeo Piolatto & Giacomo A. M. Ponzetto, 2016. "Political Centralization and Government Accountability," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(1), pages 381-422.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:sae:somere:v:51:y:2022:i:2:p:887-925. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: SAGE Publications (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.