IDEAS home Printed from https://ideas.repec.org/a/spr/qualqt/v56y2022i3d10.1007_s11135-021-01164-0.html
   My bibliography  Save this article

Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences

Author

Listed:
  • Alex Luscombe

    (University of Toronto)

  • Kevin Dick

    (Carleton University)

  • Kevin Walby

    (University of Winnipeg)

Abstract

Web scraping, defined as the automated extraction of information online, is an increasingly important means of producing data in the social sciences. We contribute to emerging social science literature on computational methods by elaborating on web scraping as a means of automated access to information. We begin by situating the practice of web scraping in context, providing an overview of how it works and how it compares to other methods in the social sciences. Next, we assess the benefits and challenges of scraping as a technique of information production. In terms of benefits, we highlight how scraping can help researchers answer new questions, supersede limits in official data, overcome access hurdles, and reinvigorate the values of sharing, openness, and trust in the social sciences. In terms of challenges, we discuss three: technical, legal, and ethical. By adopting “algorithmic thinking in the public interest” as a way of navigating these hurdles, researchers can improve the state of access to information on the Internet while also contributing to scholarly discussions about the legality and ethics of web scraping. Example software accompanying this article are available within the supplementary materials.

Suggested Citation

  • Alex Luscombe & Kevin Dick & Kevin Walby, 2022. "Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(3), pages 1023-1044, June.
  • Handle: RePEc:spr:qualqt:v:56:y:2022:i:3:d:10.1007_s11135-021-01164-0
    DOI: 10.1007/s11135-021-01164-0
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11135-021-01164-0
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11135-021-01164-0?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Grimmer, Justin, 2010. "A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases," Political Analysis, Cambridge University Press, vol. 18(1), pages 1-35, January.
    2. Alberto Cavallo, 2018. "Scraped Data and Sticky Prices," The Review of Economics and Statistics, MIT Press, vol. 100(1), pages 105-119, March.
    3. Laura K. Nelson & Derek Burk & Marcel Knudsen & Leslie McCall, 2021. "The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods," Sociological Methods & Research, , vol. 50(1), pages 202-237, February.
    4. Marc Keuschnigg & Niclas Lovsjö & Peter Hedström, 2018. "Analytical sociology and computational social science," Journal of Computational Social Science, Springer, vol. 1(1), pages 3-14, January.
    5. Ulbricht, Lena, 2020. "Scraping the demos. Digitalization, web scraping and the democratic project," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 27(3), pages 426-442.
    6. Margaret E. Roberts & Brandon M. Stewart & Dustin Tingley & Christopher Lucas & Jetson Leder‐Luis & Shana Kushner Gadarian & Bethany Albertson & David G. Rand, 2014. "Structural Topic Models for Open‐Ended Survey Responses," American Journal of Political Science, John Wiley & Sons, vol. 58(4), pages 1064-1082, October.
    7. Boeing, Geoff, 2017. "New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings," SocArXiv v54w4, Center for Open Science.
    8. Nina Cesare & Hedwig Lee & Tyler McCormick & Emma Spiro & Emilio Zagheni, 2018. "Promises and Pitfalls of Using Digital Traces for Demographic Research," Demography, Springer;Population Association of America (PAA), vol. 55(5), pages 1979-1999, October.
    9. Noortje Marres & Esther Weltevrede, 2013. "Scraping The Social?," Journal of Cultural Economy, Taylor & Francis Journals, vol. 6(3), pages 313-335, August.
    10. Lin Qiu & Sarah Hian May Chan & David Chan, 2018. "Big data in social and psychological science: theoretical and methodological issues," Journal of Computational Social Science, Springer, vol. 1(1), pages 59-66, January.
    11. Dustin S. Stoltz & Marshall A. Taylor, 2019. "Concept Mover’s Distance: measuring concept engagement via word embeddings in texts," Journal of Computational Social Science, Springer, vol. 2(2), pages 293-313, July.
    12. Feng Shi & Yongren Shi & Fedor A. Dokshin & James A. Evans & Michael W. Macy, 2017. "Millions of online book co-purchases reveal partisan differences in the consumption of science," Nature Human Behaviour, Nature, vol. 1(4), pages 1-9, April.
    13. Georg von Krogh & Eric von Hippel, 2006. "The Promise of Research on Open Source Software," Management Science, INFORMS, vol. 52(7), pages 975-983, July.
    14. Laura K. Nelson, 2020. "Computational Grounded Theory: A Methodological Framework," Sociological Methods & Research, , vol. 49(1), pages 3-42, February.
    15. Gavin Abercrombie & Riza Batista-Navarro, 2020. "Sentiment and position-taking analysis of parliamentary debates: a systematic literature review," Journal of Computational Social Science, Springer, vol. 3(1), pages 245-270, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Potter, Andrew & Soroka, Anthony & Naim, Mohamed, 2022. "Regional resilience for rail freight transport," Journal of Transport Geography, Elsevier, vol. 104(C).
    2. Tobias Blanke, 2024. "Reassembling digital archives—strategies for counter-archiving," Palgrave Communications, Palgrave Macmillan, vol. 11(1), pages 1-12, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. AJ Alvero & Jasmine Pal & Katelyn M. Moussavian, 2022. "Linguistic, cultural, and narrative capital: computational and human readings of transfer admissions essays," Journal of Computational Social Science, Springer, vol. 5(2), pages 1709-1734, November.
    2. Damani K. White-Lewis & KerryAnn O’Meara & Kiernan Mathews & Nicholas Havey, 2023. "Leaving the Institution or Leaving the Academy? Analyzing the Factors that Faculty Weigh in Actual Departure Decisions," Research in Higher Education, Springer;Association for Institutional Research, vol. 64(3), pages 473-494, May.
    3. Mónica D. Oliveira & Inês Mataloto & Panos Kanavos, 2019. "Multi-criteria decision analysis for health technology assessment: addressing methodological challenges to improve the state of the art," The European Journal of Health Economics, Springer;Deutsche Gesellschaft für Gesundheitsökonomie (DGGÖ), vol. 20(6), pages 891-918, August.
    4. Stijn Daenekindt & Julian Schaap, 2022. "Using word embedding models to capture changing media discourses: a study on the role of legitimacy, gender and genre in 24,000 music reviews, 1999–2021," Journal of Computational Social Science, Springer, vol. 5(2), pages 1615-1636, November.
    5. Christopher Wratil & Sara B Hobolt, 2019. "Public deliberations in the Council of the European Union: Introducing and validating DICEU," European Union Politics, , vol. 20(3), pages 511-531, September.
    6. Nils Augustin & Andreas Eckhardt & Alexander Willem Jong, 2023. "Understanding decentralized autonomous organizations from the inside," Electronic Markets, Springer;IIM University of St. Gallen, vol. 33(1), pages 1-14, December.
    7. Michal Ovádek & Nicolas Lampach & Arthur Dyevre, 2020. "What’s the talk in Brussels? Leveraging daily news coverage to measure issue attention in the European Union," European Union Politics, , vol. 21(2), pages 204-232, June.
    8. Jennifer Pan & Margaret E. Roberts, 2020. "Censorship’s Effect on Incidental Exposure to Information: Evidence From Wikipedia," SAGE Open, , vol. 10(1), pages 21582440198, February.
    9. Sanders, James & Lisi, Giulio & Schonhardt-Bailey, Cheryl, 2018. "Themes and topics in parliamentary oversight hearings: a new direction in textual data analysis," LSE Research Online Documents on Economics 87624, London School of Economics and Political Science, LSE Library.
    10. Yuriy Gorodnichenko & Viacheslav Sheremirov & Oleksandr Talavera, 2018. "Price Setting in Online Markets: Does IT Click?," Journal of the European Economic Association, European Economic Association, vol. 16(6), pages 1764-1811.
    11. Sheedy, Kevin D., 2010. "Intrinsic inflation persistence," Journal of Monetary Economics, Elsevier, vol. 57(8), pages 1049-1061, November.
    12. Magnus Schückes & Tobias Gutmann, 2021. "Why do startups pursue initial coin offerings (ICOs)? The role of economic drivers and social identity on funding choice," Small Business Economics, Springer, vol. 57(2), pages 1027-1052, August.
    13. Arthur Schram & Boris Van Leeuwen & Theo Offerman, 2013. "Superstars Need Social Benefits: An Experiment on Network Formation," Working Papers 1306, Departament Empresa, Universitat Autònoma de Barcelona, revised Jul 2013.
    14. Sandra Wankmüller, 2023. "A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis," Journal of Computational Social Science, Springer, vol. 6(1), pages 91-163, April.
    15. Jurić Tado, 2022. "Forecasting Migration and Integration Trends Using Digital Demography – A Case Study of Emigration Flows from Croatia to Austria and Germany," Comparative Southeast European Studies, De Gruyter, vol. 70(1), pages 125-152, March.
    16. McCannon, Bryan & Zhou, Yang & Hall, Joshua, 2021. "Measuring a Contract’s Breadth: A Text Analysis," Working Papers 11013, George Mason University, Mercatus Center.
    17. Minchul Lee & Min Song, 2020. "Incorporating citation impact into analysis of research trends," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(2), pages 1191-1224, August.
    18. Grajzl, Peter & Murrell, Peter, 2021. "A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution I: generating and interpreting the estimates," Journal of Institutional Economics, Cambridge University Press, vol. 17(1), pages 1-19, February.
    19. Stam, Wouter, 2009. "When does community participation enhance the performance of open source software companies?," Research Policy, Elsevier, vol. 38(8), pages 1288-1299, October.
    20. David Bholat & Stephen Hans & Pedro Santos & Cheryl Schonhardt-Bailey, 2015. "Text mining for central banks," Handbooks, Centre for Central Banking Studies, Bank of England, number 33, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:56:y:2022:i:3:d:10.1007_s11135-021-01164-0. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.