IDEAS home Printed from https://ideas.repec.org/p/zbw/zewdip/18033.html
   My bibliography  Save this paper

Web mining of firm websites: A framework for web scraping and a pilot study for Germany

Author

Listed:
  • Kinne, Jan
  • Axenbeck, Janna

Abstract

Nowadays, almost all (relevant) firms have their own websites which they use to publish information about their products and services. Using the example of innovation in firms, we outline a framework for extracting information from firm websites using web scraping and data mining. For this purpose, we present an easy and free-to-use web scraping tool for large-scale data retrieval from firm websites. We apply this tool in a large-scale pilot study to provide information on the data source (i.e. the population of firm websites in Germany), which has as yet not been studied rigorously in terms of its qualitative and quantitative properties. We find, inter alia, that the use of websites and websites' characteristics (number of subpages and hyperlinks, text volume, language used) differs according to firm size, age, location, and sector. Web-based studies also have to contend with distinct outliers and the fact that low broadband availability appears to prevent firms from operating a website. Finally, we propose two approaches based on neural network language models and social network analysis to derive firm-level information from the extracted web data.

Suggested Citation

  • Kinne, Jan & Axenbeck, Janna, 2018. "Web mining of firm websites: A framework for web scraping and a pilot study for Germany," ZEW Discussion Papers 18-033, ZEW - Leibniz Centre for European Economic Research.
  • Handle: RePEc:zbw:zewdip:18033
    as

    Download full text from publisher

    File URL: https://www.econstor.eu/bitstream/10419/181864/1/1029682763.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Janz, Norbert & Ebling, Günther & Gottschalk, Sandra & Peters, Bettina & Schmidt, Tobias, 2002. "Innovationsverhalten der deutschen Wirtschaft: Indikatorenbericht zur Innovationserhebung 2001," The Annual German Innovation Survey, Key Figures Reports 111699, ZEW - Leibniz Centre for European Economic Research.
    2. Manfred M. Fischer & Arthur Getis (ed.), 2010. "Handbook of Applied Spatial Analysis," Springer Books, Springer, number 978-3-642-03647-7, July.
    3. Sanjay K. Arora & Jan Youtie & Philip Shapira & Lidan Gao & TingTing Ma, 2013. "Entry strategies in an emerging technology: a pilot web-based study of graphene firms," Scientometrics, Springer;Akadémiai Kiadó, vol. 95(3), pages 1189-1207, June.
    4. Mohammad Arzaghi & J. Vernon Henderson, 2008. "Networking off Madison Avenue," Review of Economic Studies, Oxford University Press, vol. 75(4), pages 1011-1038.
    5. Bersch, Johannes & Gottschalk, Sandra & Müller, Bettina & Niefert, Michaela, 2014. "The Mannheim Enterprise Panel (MUP) and firm statistics for Germany," ZEW Discussion Papers 14-104, ZEW - Leibniz Centre for European Economic Research.
    6. Max Nathan & Anna Rosso, 2017. "Innovative events," Development Working Papers 429, Centro Studi Luca d'Agliano, University of Milano, revised 08 Apr 2019.
    7. Rammer, Christian & Kinne, Jan & Blind, Knut, 2016. "Microgeography of innovation in the city: Location patterns of innovative firms in Berlin," ZEW Discussion Papers 16-080, ZEW - Leibniz Centre for European Economic Research.
    8. Rammer, Christian & Berger, Marius & Doherr, Thorsten & Hud, Martin & Hünermund, Paul & Iferd, Younes & Peters, Bettina & Schubert, Torben, 2017. "Innovationsverhalten der deutschen Wirtschaft: Indikatorenbericht zur Innovationserhebung 2016," The Annual German Innovation Survey, Key Figures Reports 155758, ZEW - Leibniz Centre for European Economic Research.
    9. Bronwyn H. Hall & Nathan Rosenberg (ed.), 2010. "Handbook of the Economics of Innovation," Handbook of the Economics of Innovation, Elsevier, edition 1, volume 1, number 1, 00.
    10. Gilles Duranton & J. V. Henderson & William C. Strange (ed.), 2015. "Handbook of Regional and Urban Economics," Handbook of Regional and Urban Economics, Elsevier, edition 1, volume 5, number 5.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Proeger, Till & Meub, Lukas & Pölert, Hauke, 2021. "Analyse des Digitalisierungsgrads von Bildungseinrichtungen auf Basis von Webscraping - eine methodische Vorstudie," Göttinger Beiträge zur Handwerksforschung 56, Volkswirtschaftliches Institut für Mittelstand und Handwerk an der Universität Göttingen (ifh).
    2. Janna Axenbeck & Patrick Breithaupt, 2021. "Innovation indicators based on firm websites—Which website characteristics predict firm-level innovation activity?," PLOS ONE, Public Library of Science, vol. 16(4), pages 1-23, April.
    3. Jan Kinne & David Lenz, 2021. "Predicting innovative firms using web mining and deep learning," PLOS ONE, Public Library of Science, vol. 16(4), pages 1-18, April.
    4. Axenbeck, Janna & Breithaupt, Patrick, 2019. "Web-based innovation indicators: Which firm website characteristics relate to firm-level innovation activity?," ZEW Discussion Papers 19-063, ZEW - Leibniz Centre for European Economic Research.
    5. Abbasiharofteh, Milad & Kinne, Jan & Krüger, Miriam, 2021. "The strength of weak and strong ties in bridging geographic and cognitive distances," ZEW Discussion Papers 21-049, ZEW - Leibniz Centre for European Economic Research.
    6. German Data Forum RatSWD (ed.), 2020. "Big data in social, behavioural, and economic sciences: Data access and research data management," RatSWD Output Series, German Data Forum (RatSWD), volume 6, number 6-4en, December.
    7. Meub, Lukas & Proeger, Till & Bizer, Kilian, 2022. "Vernetzung von Unternehmen und Forschungseinrichtungen in regionalen Innovationssystemen durch Webscraping," Göttinger Beiträge zur Handwerksforschung 62, Volkswirtschaftliches Institut für Mittelstand und Handwerk an der Universität Göttingen (ifh).
    8. Rammer, Christian & Es-Sadki, Nordine, 2022. "Using big data for generating firm-level innovation indicators: A literature review," ZEW Discussion Papers 22-007, ZEW - Leibniz Centre for European Economic Research.
    9. Proeger, Till & Meub, Lukas & Bizer, Kilian, 2021. "Webscraping als Instrument zur tagesaktuellen und umfassenden digitalen Analyse des Handwerks," Göttinger Beiträge zur Handwerksforschung 55, Volkswirtschaftliches Institut für Mittelstand und Handwerk an der Universität Göttingen (ifh).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jan Kinne & Janna Axenbeck, 2020. "Web mining for innovation ecosystem mapping: a framework and a large-scale pilot study," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2011-2041, December.
    2. Gilles Duranton & William R. Kerr, 2015. "The Logic of Agglomeration," NBER Working Papers 21452, National Bureau of Economic Research, Inc.
    3. Axenbeck, Janna & Breithaupt, Patrick, 2019. "Web-based innovation indicators: Which firm website characteristics relate to firm-level innovation activity?," ZEW Discussion Papers 19-063, ZEW - Leibniz Centre for European Economic Research.
    4. Annekatrin Niebuhr & Jan Cornelius Peters & Alex Schmidke, 2020. "Spatial sorting of innovative firms and heterogeneous effects of agglomeration on innovation in Germany," The Journal of Technology Transfer, Springer, vol. 45(5), pages 1343-1375, October.
    5. Rammer, Christian & Fernández, Gastón P. & Czarnitzki, Dirk, 2021. "Artificial intelligence and industrial innovation: Evidence from firm-level data," ZEW Discussion Papers 21-036, ZEW - Leibniz Centre for European Economic Research.
    6. Tobias Schlegel & Curdin Pfister & Dietmar Harhoff & Uschi Backes-Gellner, 2022. "Innovation effects of universities of applied sciences: an assessment of regional heterogeneity," The Journal of Technology Transfer, Springer, vol. 47(1), pages 63-118, February.
    7. Jonathan I. Dingel & Felix Tintelnot, 2020. "Spatial Economics for Granular Settings," NBER Working Papers 27287, National Bureau of Economic Research, Inc.
    8. Karen Miranda & Oscar Martínez Ibáñez & Miguel Manjón Antolín, 2015. "Estimating Individual Effects and their Spatial Spillovers in Linear Panel Data Models," Post-Print hal-01430809, HAL.
    9. Proost, Stef & Thisse, Jacques-François, 2015. "Skilled Cities, Regional Disparities, and Efficient Transport: The state of the art and a research agenda," CEPR Discussion Papers 10790, C.E.P.R. Discussion Papers.
    10. Enrico Moretti, 2019. "The Effect of High-Tech Clusters on the Productivity of Top Inventors," NBER Working Papers 26270, National Bureau of Economic Research, Inc.
    11. Carlino, Gerald & Kerr, William R., 2015. "Agglomeration and Innovation," Handbook of Regional and Urban Economics, in: Gilles Duranton & J. V. Henderson & William C. Strange (ed.), Handbook of Regional and Urban Economics, edition 1, volume 5, chapter 0, pages 349-404, Elsevier.
    12. Olof Ejermo & Katrin Hussinger & Basheer Kalash & Torben Schubert, 2022. "Innovation in Malmö after the Öresund Bridge," Journal of Regional Science, Wiley Blackwell, vol. 62(1), pages 5-20, January.
    13. Kristian Behrens & Sergey Kichko & Jacques-Francois Thisse, 2021. "Working from Home: Too Much of a Good Thing?," CESifo Working Paper Series 8831, CESifo.
    14. Stephen J. Redding, 2020. "Trade and Geography," NBER Working Papers 27821, National Bureau of Economic Research, Inc.
    15. Holl, Adelheid & Peters, Bettina & Rammer, Christian, 2020. "Local knowledge spillovers and innovation persistence of firms," ZEW Discussion Papers 20-005, ZEW - Leibniz Centre for European Economic Research.
    16. Chris Forman & Avi Goldfarb, 2021. "Concentration and Agglomeration of IT Innovation and Entrepreneurship: Evidence from Patenting," NBER Chapters, in: The Role of Innovation and Entrepreneurship in Economic Growth, pages 95-121, National Bureau of Economic Research, Inc.
    17. Giulia Faggio & Teresa Schlüter & Philipp vom Berge, 2018. "Interaction of Public and Private Employment: Evidence from a German Government Move," SERC Discussion Papers 0229, Centre for Economic Performance, LSE.
    18. Fritsch, Michael & Wyrwich, Michael, 2021. "Is innovation (increasingly) concentrated in large cities? An international comparison," Research Policy, Elsevier, vol. 50(6).
    19. Combes, Pierre-Philippe & Gobillon, Laurent, 2015. "The Empirics of Agglomeration Economies," Handbook of Regional and Urban Economics, in: Gilles Duranton & J. V. Henderson & William C. Strange (ed.), Handbook of Regional and Urban Economics, edition 1, volume 5, chapter 0, pages 247-348, Elsevier.
    20. Michael Fritsch & Martin Obschonka & Fabian Wahl & Michael Wyrwich, 2021. "Cultural Imprinting: Ancient Origins of Entrepreneurship and Innovation in Germany," Jena Economic Research Papers 2021-012, Friedrich-Schiller-University Jena.

    More about this item

    Keywords

    Web Mining; Web Scraping; R&D; R&I; STI; Innovation; Indicators; Text Mining;
    All these keywords.

    JEL classification:

    • O30 - Economic Development, Innovation, Technological Change, and Growth - - Innovation; Research and Development; Technological Change; Intellectual Property Rights - - - General
    • C81 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Methodology for Collecting, Estimating, and Organizing Microeconomic Data; Data Access
    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:zbw:zewdip:18033. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: . General contact details of provider: https://edirc.repec.org/data/zemande.html .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ZBW - Leibniz Information Centre for Economics (email available below). General contact details of provider: https://edirc.repec.org/data/zemande.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.