IDEAS home Printed from https://ideas.repec.org/p/zbw/zewdip/18033.html
   My bibliography  Save this paper

Web mining of firm websites: A framework for web scraping and a pilot study for Germany

Author

Listed:
  • Kinne, Jan
  • Axenbeck, Janna

Abstract

Nowadays, almost all (relevant) firms have their own websites which they use to publish information about their products and services. Using the example of innovation in firms, we outline a framework for extracting information from firm websites using web scraping and data mining. For this purpose, we present an easy and free-to-use web scraping tool for large-scale data retrieval from firm websites. We apply this tool in a large-scale pilot study to provide information on the data source (i.e. the population of firm websites in Germany), which has as yet not been studied rigorously in terms of its qualitative and quantitative properties. We find, inter alia, that the use of websites and websites' characteristics (number of subpages and hyperlinks, text volume, language used) differs according to firm size, age, location, and sector. Web-based studies also have to contend with distinct outliers and the fact that low broadband availability appears to prevent firms from operating a website. Finally, we propose two approaches based on neural network language models and social network analysis to derive firm-level information from the extracted web data.

Suggested Citation

  • Kinne, Jan & Axenbeck, Janna, 2018. "Web mining of firm websites: A framework for web scraping and a pilot study for Germany," ZEW Discussion Papers 18-033, ZEW - Leibniz Centre for European Economic Research.
  • Handle: RePEc:zbw:zewdip:18033
    as

    Download full text from publisher

    File URL: https://www.econstor.eu/bitstream/10419/181864/1/1029682763.pdf
    Download Restriction: no

    References listed on IDEAS

    as
    1. Janz, Norbert & Ebling, Günther & Gottschalk, Sandra & Peters, Bettina & Schmidt, Tobias, 2002. "Innovationsverhalten der deutschen Wirtschaft: Indikatorenbericht zur Innovationserhebung 2001," The Annual German Innovation Survey, Key Figures Reports 111699, ZEW - Leibniz Centre for European Economic Research.
    2. Sanjay K. Arora & Jan Youtie & Philip Shapira & Lidan Gao & TingTing Ma, 2013. "Entry strategies in an emerging technology: a pilot web-based study of graphene firms," Scientometrics, Springer;Akadémiai Kiadó, vol. 95(3), pages 1189-1207, June.
    3. Mohammad Arzaghi & J. Vernon Henderson, 2008. "Networking off Madison Avenue," Review of Economic Studies, Oxford University Press, vol. 75(4), pages 1011-1038.
    4. Bersch, Johannes & Gottschalk, Sandra & Müller, Bettina & Niefert, Michaela, 2014. "The Mannheim Enterprise Panel (MUP) and firm statistics for Germany," ZEW Discussion Papers 14-104, ZEW - Leibniz Centre for European Economic Research.
    5. Max Nathan & Anna Rosso, 2017. "Innovative events," Development Working Papers 429, Centro Studi Luca d'Agliano, University of Milano, revised 08 Apr 2019.
    6. Rammer, Christian & Kinne, Jan & Blind, Knut, 2016. "Microgeography of innovation in the city: Location patterns of innovative firms in Berlin," ZEW Discussion Papers 16-080, ZEW - Leibniz Centre for European Economic Research.
    7. Rammer, Christian & Berger, Marius & Doherr, Thorsten & Hud, Martin & Hünermund, Paul & Iferd, Younes & Peters, Bettina & Schubert, Torben, 2017. "Innovationsverhalten der deutschen Wirtschaft: Indikatorenbericht zur Innovationserhebung 2016," The Annual German Innovation Survey, Key Figures Reports 155758, ZEW - Leibniz Centre for European Economic Research.
    Full references (including those not matched with items on IDEAS)

    More about this item

    Keywords

    Web Mining; Web Scraping; R&D; R&I; STI; Innovation; Indicators; Text Mining;

    JEL classification:

    • O30 - Economic Development, Innovation, Technological Change, and Growth - - Innovation; Research and Development; Technological Change; Intellectual Property Rights - - - General
    • C81 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Methodology for Collecting, Estimating, and Organizing Microeconomic Data; Data Access
    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:zbw:zewdip:18033. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (ZBW - Leibniz Information Centre for Economics). General contact details of provider: http://edirc.repec.org/data/zemande.html .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.