IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2308.02231.html
   My bibliography  Save this paper

Should we trust web-scraped data?

Author

Listed:
  • Jens Foerderer

Abstract

The increasing adoption of econometric and machine-learning approaches by empirical researchers has led to a widespread use of one data collection method: web scraping. Web scraping refers to the use of automated computer programs to access websites and download their content. The key argument of this paper is that na\"ive web scraping procedures can lead to sampling bias in the collected data. This article describes three sources of sampling bias in web-scraped data. More specifically, sampling bias emerges from web content being volatile (i.e., being subject to change), personalized (i.e., presented in response to request characteristics), and unindexed (i.e., abundance of a population register). In a series of examples, I illustrate the prevalence and magnitude of sampling bias. To support researchers and reviewers, this paper provides recommendations on anticipating, detecting, and overcoming sampling bias in web-scraped data.

Suggested Citation

  • Jens Foerderer, 2023. "Should we trust web-scraped data?," Papers 2308.02231, arXiv.org.
  • Handle: RePEc:arx:papers:2308.02231
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2308.02231
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Alberto Cavallo, 2017. "Are Online and Offline Prices Similar? Evidence from Large Multi-channel Retailers," American Economic Review, American Economic Association, vol. 107(1), pages 283-303, January.
    2. Michael Callen & James D. Long, 2015. "Institutional Corruption and Election Fraud: Evidence from a Field Experiment in Afghanistan," American Economic Review, American Economic Association, vol. 105(1), pages 354-381, January.
    3. Crosignani, Matteo & Macchiavelli, Marco & Silva, André F., 2023. "Pirates without borders: The propagation of cyberattacks through firms’ supply chains," Journal of Financial Economics, Elsevier, vol. 147(2), pages 432-448.
    4. Garz, Marcel & Sood, Gaurav & Stone, Daniel F. & Wallace, Justin, 2020. "The supply of media slant across outlets and demand for slant within outlets: Evidence from US presidential campaign news," European Journal of Political Economy, Elsevier, vol. 63(C).
    5. Viv Cothey, 2004. "Web‐crawling reliability," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 55(14), pages 1228-1238, December.
    6. Anindya Ghose & Sang Pil Han, 2014. "Estimating Demand for Mobile Applications in the New Economy," Management Science, INFORMS, vol. 60(6), pages 1470-1488, June.
    7. Hema Yoganarasimhan, 2020. "Search Personalization Using Machine Learning," Management Science, INFORMS, vol. 66(3), pages 1045-1070, March.
    8. King, Gary & Pan, Jennifer & Roberts, Margaret E., 2013. "How Censorship in China Allows Government Criticism but Silences Collective Expression," American Political Science Review, Cambridge University Press, vol. 107(2), pages 326-343, May.
    9. Boegershausen, Johannes & Datta, Hannes & Borah, Abhishek & Stephen, Andrew, 2022. "Fields of Gold: Web Scraping and APIs for Impactful Marketing Insights," Other publications TiSEM 5f1ed70a-48c3-422c-bc10-0, Tilburg University, School of Economics and Management.
    10. Osmundsen, Mathias & Bor, Alexander & Vahlstrup, Peter Bjerregaard & Bechmann, Anja & Petersen, Michael Bang, 2021. "Partisan Polarization Is the Primary Psychological Motivation behind Political Fake News Sharing on Twitter," American Political Science Review, Cambridge University Press, vol. 115(3), pages 999-1015, August.
    11. Ritu Agarwal & Vasant Dhar, 2014. "Editorial —Big Data, Data Science, and Analytics: The Opportunity and Challenge for IS Research," Information Systems Research, INFORMS, vol. 25(3), pages 443-448, September.
    12. Boyd-Swan, Casey & Herbst, Chris M., 2018. "The demand for teacher characteristics in the market for child care: Evidence from a field experiment," Journal of Public Economics, Elsevier, vol. 159(C), pages 183-202.
    13. Chris Forman & Anindya Ghose & Avi Goldfarb, 2009. "Competition Between Local and Electronic Markets: How the Benefit of Buying Online Depends on Where You Live," Management Science, INFORMS, vol. 55(1), pages 47-57, January.
    14. Benjamin Edelman, 2012. "Using Internet Data for Economic Research," Journal of Economic Perspectives, American Economic Association, vol. 26(2), pages 189-206, Spring.
    15. Fouka, Vasiliki & Voth, Hans-Joachim, 2023. "Collective Remembrance and Private Choice: German–Greek Conflict and Behavior in Times of Crisis," American Political Science Review, Cambridge University Press, vol. 117(3), pages 851-870, August.
    16. Gordon Burtch & Anindya Ghose & Sunil Wattal, 2013. "An Empirical Examination of the Antecedents and Consequences of Contribution Patterns in Crowd-Funded Markets," Information Systems Research, INFORMS, vol. 24(3), pages 499-519, September.
    17. Michael Luca & Georgios Zervas, 2016. "Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud," Management Science, INFORMS, vol. 62(12), pages 3412-3427, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mingming Shi & Jun Zhou & Zhou Jiang, 2019. "Consumer heterogeneity and online vs. offline retail spatial competition," Frontiers of Business Research in China, Springer, vol. 13(1), pages 1-19, December.
    2. Jinyang Zheng & Zhengling Qi & Yifan Dou & Yong Tan, 2019. "How Mega Is the Mega? Exploring the Spillover Effects of WeChat Using Graphical Model," Information Systems Research, INFORMS, vol. 30(4), pages 1343-1362, December.
    3. Cong Peng, 2019. "Does e-commerce reduce traffic congestion? Evidence from Alibaba Single Day shopping event," CEP Discussion Papers dp1646, Centre for Economic Performance, LSE.
    4. Jens Foerderer, 2020. "Interfirm Exchange and Innovation in Platform Ecosystems: Evidence from Apple’s Worldwide Developers Conference," Management Science, INFORMS, vol. 66(10), pages 4772-4787, October.
    5. Tianshu Sun & Lanfei Shi & Siva Viswanathan & Elena Zheleva, 2019. "Motivating Effective Mobile App Adoptions: Evidence from a Large-Scale Randomized Field Experiment," Information Systems Research, INFORMS, vol. 30(2), pages 523-539, June.
    6. Herweg, Fabian & Helfrich, Magdalena, 2017. "Salience in Retailing: Vertical Restraints on Internet Sales," CEPR Discussion Papers 11948, C.E.P.R. Discussion Papers.
    7. Xiong Xiong & Zhang Jin & Feng Xu & Jin Xi, 2016. "Review on Financial Innovations in Big Data Era," Journal of Systems Science and Information, De Gruyter, vol. 4(6), pages 489-504, December.
    8. Shengjun Mao & Sanjeev Dewan & Yi-Jen (Ian) Ho, 2023. "Personalized Ranking at a Mobile App Distribution Platform," Information Systems Research, INFORMS, vol. 34(3), pages 811-827, September.
    9. Morgan Swink & Kejia Hu & Xiande Zhao, 2022. "Analytics applications, limitations, and opportunities in restaurant supply chains," Production and Operations Management, Production and Operations Management Society, vol. 31(10), pages 3710-3726, October.
    10. Peng, Cong, 2019. "Does e-commerce reduce traffic congestion? Evidence from Alibaba Single Day shopping event," LSE Research Online Documents on Economics 103411, London School of Economics and Political Science, LSE Library.
    11. Ratchford, Brian & Soysal, Gonca & Zentner, Alejandro & Gauri, Dinesh K., 2022. "Online and offline retailing: What we know and directions for future research," Journal of Retailing, Elsevier, vol. 98(1), pages 152-177.
    12. Shao, Xiaofeng, 2021. "Omnichannel retail move in a dual-channel supply chain," European Journal of Operational Research, Elsevier, vol. 294(3), pages 936-950.
    13. Gordon Burtch & Anindya Ghose & Sunil Wattal, 2016. "Secret Admirers: An Empirical Examination of Information Hiding and Contribution Dynamics in Online Crowdfunding," Information Systems Research, INFORMS, vol. 27(3), pages 478-496, September.
    14. Zhuang, Hejun & Popkowski Leszczyc, Peter T.L. & Lin, Yuanfang, 2018. "Why is Price Dispersion Higher Online than Offline? The Impact of Retailer Type and Shopping Risk on Price Dispersion," Journal of Retailing, Elsevier, vol. 94(2), pages 136-153.
    15. Helfrich, Magdalena & Herweg, Fabian, 2020. "Context-dependent preferences and retailing: Vertical restraints on internet sales," Journal of Behavioral and Experimental Economics (formerly The Journal of Socio-Economics), Elsevier, vol. 87(C).
    16. Chun, Hyunbae & Joo, Hailey Hayeon & Kang, Jisoo & Lee, Yoonsoo, 2020. "Diffusion of E-Commerce and Retail Job Apocalypse: Evidence from Credit Card Data on Online Spending," CEI Working Paper Series 2020-7, Center for Economic Institutions, Institute of Economic Research, Hitotsubashi University.
    17. Xulia González & Daniel Miles-Touya, 2018. "Price dispersion, chain heterogeneity, and search in online grocery markets," SERIEs: Journal of the Spanish Economic Association, Springer;Spanish Economic Association, vol. 9(1), pages 115-139, March.
    18. Naveen Kumar & Liangfei Qiu & Subodha Kumar, 2018. "Exit, Voice, and Response on Digital Platforms: An Empirical Investigation of Online Management Response Strategies," Information Systems Research, INFORMS, vol. 29(4), pages 849-870, December.
    19. Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
    20. Mingfeng Lin & Siva Viswanathan, 2016. "Home Bias in Online Investments: An Empirical Study of an Online Crowdfunding Market," Management Science, INFORMS, vol. 62(5), pages 1393-1414, May.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2308.02231. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.