IDEAS home Printed from https://ideas.repec.org/p/zbw/zewdip/18033.html
   My bibliography  Save this paper

Web mining of firm websites: A framework for web scraping and a pilot study for Germany

Author

Listed:
  • Kinne, Jan
  • Axenbeck, Janna

Abstract

Nowadays, almost all (relevant) firms have their own websites which they use to publish information about their products and services. Using the example of innovation in firms, we outline a framework for extracting information from firm websites using web scraping and data mining. For this purpose, we present an easy and free-to-use web scraping tool for large-scale data retrieval from firm websites. We apply this tool in a large-scale pilot study to provide information on the data source (i.e. the population of firm websites in Germany), which has as yet not been studied rigorously in terms of its qualitative and quantitative properties. We find, inter alia, that the use of websites and websites' characteristics (number of subpages and hyperlinks, text volume, language used) differs according to firm size, age, location, and sector. Web-based studies also have to contend with distinct outliers and the fact that low broadband availability appears to prevent firms from operating a website. Finally, we propose two approaches based on neural network language models and social network analysis to derive firm-level information from the extracted web data.

Suggested Citation

  • Kinne, Jan & Axenbeck, Janna, 2018. "Web mining of firm websites: A framework for web scraping and a pilot study for Germany," ZEW Discussion Papers 18-033, ZEW - Leibniz Centre for European Economic Research.
  • Handle: RePEc:zbw:zewdip:18033
    as

    Download full text from publisher

    File URL: https://www.econstor.eu/bitstream/10419/181864/1/1029682763.pdf
    Download Restriction: no

    References listed on IDEAS

    as
    1. Janz, Norbert & Ebling, Günther & Gottschalk, Sandra & Peters, Bettina & Schmidt, Tobias, 2002. "Innovationsverhalten der deutschen Wirtschaft: Indikatorenbericht zur Innovationserhebung 2001," The Annual German Innovation Survey, Key Figures Reports 111699, ZEW - Leibniz Centre for European Economic Research.
    2. Sanjay K. Arora & Jan Youtie & Philip Shapira & Lidan Gao & TingTing Ma, 2013. "Entry strategies in an emerging technology: a pilot web-based study of graphene firms," Scientometrics, Springer;Akadémiai Kiadó, vol. 95(3), pages 1189-1207, June.
    3. Mohammad Arzaghi & J. Vernon Henderson, 2008. "Networking off Madison Avenue," Review of Economic Studies, Oxford University Press, vol. 75(4), pages 1011-1038.
    4. Bersch, Johannes & Gottschalk, Sandra & Müller, Bettina & Niefert, Michaela, 2014. "The Mannheim Enterprise Panel (MUP) and firm statistics for Germany," ZEW Discussion Papers 14-104, ZEW - Leibniz Centre for European Economic Research.
    5. Max Nathan & Anna Rosso, 2017. "Innovative events," Development Working Papers 429, Centro Studi Luca d'Agliano, University of Milano, revised 08 Apr 2019.
    6. Rammer, Christian & Kinne, Jan & Blind, Knut, 2016. "Microgeography of innovation in the city: Location patterns of innovative firms in Berlin," ZEW Discussion Papers 16-080, ZEW - Leibniz Centre for European Economic Research.
    7. Rammer, Christian & Berger, Marius & Doherr, Thorsten & Hud, Martin & Hünermund, Paul & Iferd, Younes & Peters, Bettina & Schubert, Torben, 2017. "Innovationsverhalten der deutschen Wirtschaft: Indikatorenbericht zur Innovationserhebung 2016," The Annual German Innovation Survey, Key Figures Reports 155758, ZEW - Leibniz Centre for European Economic Research.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Axenbeck, Janna & Breithaupt, Patrick, 2019. "Web-based innovation indicators: Which firm website characteristics relate to firm-level innovation activity?," ZEW Discussion Papers 19-063, ZEW - Leibniz Centre for European Economic Research.
    2. German Data Forum RatSWD (ed.), 2020. "Big data in social, behavioural, and economic sciences: Data access and research data management," RatSWD Output Series, German Data Forum (RatSWD), volume 6, number 6-4en, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Axenbeck, Janna & Breithaupt, Patrick, 2019. "Web-based innovation indicators: Which firm website characteristics relate to firm-level innovation activity?," ZEW Discussion Papers 19-063, ZEW - Leibniz Centre for European Economic Research.
    2. Kinne, Jan & Resch, Bernd, 2017. "Analysing and predicting micro-location patterns of software firms," ZEW Discussion Papers 17-063, ZEW - Leibniz Centre for European Economic Research.
    3. Jan Kinne & Janna Axenbeck, 2020. "Web mining for innovation ecosystem mapping: a framework and a large-scale pilot study," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2011-2041, December.
    4. Christian Rammer & Jan Kinne & Knut Blind, 2020. "Knowledge proximity and firm innovation: A microgeographic analysis for Berlin," Urban Studies, Urban Studies Journal Limited, vol. 57(5), pages 996-1014, April.
    5. Konon, Alexander & Fritsch, Michael & Kritikos, Alexander S., 2018. "Business cycles and start-ups across industries: An empirical analysis of German regions," Journal of Business Venturing, Elsevier, vol. 33(6), pages 742-761.
    6. Stephen J. Redding, 2010. "The Empirics Of New Economic Geography," Journal of Regional Science, Wiley Blackwell, vol. 50(1), pages 297-311, February.
    7. Billings, Stephen B. & Johnson, Erik B., 2012. "A non-parametric test for industrial specialization," Journal of Urban Economics, Elsevier, vol. 71(3), pages 312-331.
    8. Mark J. O. Bagley, 2019. "Networks, geography and the survival of the firm," Journal of Evolutionary Economics, Springer, vol. 29(4), pages 1173-1209, September.
    9. Gabriel Ahlfeldt & Pantelis Koutroumpis & Tommaso Valletti, 2017. "Speed 2.0: Evaluating Access to Universal Digital Highways," Journal of the European Economic Association, European Economic Association, vol. 15(3), pages 586-625.
    10. Martin, Philippe & Mayer, Thierry & Mayneris, Florian, 2011. "Spatial concentration and plant-level productivity in France," Journal of Urban Economics, Elsevier, vol. 69(2), pages 182-195, March.
    11. Li, Jing, 2014. "The influence of state policy and proximity to medical services on health outcomes," Journal of Urban Economics, Elsevier, vol. 80(C), pages 97-109.
    12. Edward L. Glaeser & Scott Duke Kominers & Michael Luca & Nikhil Naik, 2018. "Big Data And Big Cities: The Promises And Limitations Of Improved Measures Of Urban Life," Economic Inquiry, Western Economic Association International, vol. 56(1), pages 114-137, January.
    13. Vicente Romero de à vila Serrano, 2019. "The Intrametropolitan Geography of Knowledge-Intensive Business Services (KIBS): A Comparative Analysis of Six European and U.S. City-Regions," Economic Development Quarterly, , vol. 33(4), pages 279-295, November.
    14. Aaron Chatterji & Edward Glaeser & William Kerr, 2014. "Clusters of Entrepreneurship and Innovation," Innovation Policy and the Economy, University of Chicago Press, vol. 14(1), pages 129-166.
    15. Isabel Tecu, 2013. "The Location of Industrial Innovation: Does Manufacturing Matter?," Working Papers 13-09, Center for Economic Studies, U.S. Census Bureau.
    16. Hiroyasu Inoue & Kentaro Nakajima & Yukiko Umeno Saito, 2019. "Localization of collaborations in knowledge creation," The Annals of Regional Science, Springer;Western Regional Science Association, vol. 62(1), pages 119-140, February.
    17. William R. Kerr, 2010. "Breakthrough Inventions and Migrating Clusters of Innovation," NBER Chapters, in: Cities and Entrepreneurship, National Bureau of Economic Research, Inc.
    18. Li, Yin & Arora, Sanjay & Youtie, Jan & Shapira, Philip, 2018. "Using web mining to explore Triple Helix influences on growth in small and mid-size firms," Technovation, Elsevier, vol. 76, pages 3-14.
    19. William R. Kerr & Scott Duke Kominers, 2015. "Agglomerative Forces and Cluster Shapes," The Review of Economics and Statistics, MIT Press, vol. 97(4), pages 877-899, October.
    20. Proost, Stef & Thisse, Jacques-François, 2015. "Skilled Cities, Regional Disparities, and Efficient Transport: The state of the art and a research agenda," CEPR Discussion Papers 10790, C.E.P.R. Discussion Papers.

    More about this item

    Keywords

    Web Mining; Web Scraping; R&D; R&I; STI; Innovation; Indicators; Text Mining;
    All these keywords.

    JEL classification:

    • O30 - Economic Development, Innovation, Technological Change, and Growth - - Innovation; Research and Development; Technological Change; Intellectual Property Rights - - - General
    • C81 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Methodology for Collecting, Estimating, and Organizing Microeconomic Data; Data Access
    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:zbw:zewdip:18033. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (ZBW - Leibniz Information Centre for Economics). General contact details of provider: http://edirc.repec.org/data/zemande.html .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.