IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2307.09332.html

Company2Vec -- German Company Embeddings based on Corporate Websites

Author

Listed:
  • Christopher Gerling

Abstract

With Company2Vec, the paper proposes a novel application in representation learning. The model analyzes business activities from unstructured company website data using Word2Vec and dimensionality reduction. Company2Vec maintains semantic language structures and thus creates efficient company embeddings in fine-granular industries. These semantic embeddings can be used for various applications in banking. Direct relations between companies and words allow semantic business analytics (e.g. top-n words for a company). Furthermore, industry prediction is presented as a supervised learning application and evaluation method. The vectorized structure of the embeddings allows measuring companies similarities with the cosine distance. Company2Vec hence offers a more fine-grained comparison of companies than the standard industry labels (NACE). This property is relevant for unsupervised learning tasks, such as clustering. An alternative industry segmentation is shown with k-means clustering on the company embeddings. Finally, this paper proposes three algorithms for (1) firm-centric, (2) industry-centric and (3) portfolio-centric peer-firm identification.

Suggested Citation

  • Christopher Gerling, 2023. "Company2Vec -- German Company Embeddings based on Corporate Websites," Papers 2307.09332, arXiv.org.
  • Handle: RePEc:arx:papers:2307.09332
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2307.09332
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Lee, Charles M.C. & Ma, Paul & Wang, Charles C.Y., 2015. "Search-based peer firms: Aggregating investor perceptions through internet co-searches," Journal of Financial Economics, Elsevier, vol. 116(2), pages 410-431.
    2. Samuel R�nnqvist & Peter Sarlin, 2015. "Bank networks from text: interrelations, centrality and determinants," Quantitative Finance, Taylor & Francis Journals, vol. 15(10), pages 1619-1635, October.
    3. Samuel Ronnqvist & Peter Sarlin, 2014. "Bank Networks from Text: Interrelations, Centrality and Determinants," Papers 1406.7752, arXiv.org, revised Jul 2015.
    4. Gerard Hoberg & Gordon Phillips, 2016. "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy, University of Chicago Press, vol. 124(5), pages 1423-1465.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Christopher Gerling & Stefan Lessmann, 2023. "Multimodal Document Analytics for Banking Process Automation," Papers 2307.11845, arXiv.org, revised Nov 2023.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Fang, Yi & Lin, Hao & Lu, Liping, 2025. "Measuring systemic risk from textual Analysis: Evidence from Chinese Banks," International Review of Economics & Finance, Elsevier, vol. 103(C).
    2. Dimitrios Vamvourellis & M'at'e Toth & Snigdha Bhagat & Dhruv Desai & Dhagash Mehta & Stefano Pasquali, 2023. "Company Similarity using Large Language Models," Papers 2308.08031, arXiv.org.
    3. Xi Zhang & Jiawei Shi & Di Wang & Binxing Fang, 2018. "Exploiting Investors Social Network for Stock Prediction in China's Market," Papers 1801.00597, arXiv.org.
    4. Baptiste Colas & Carl Brousseau, 2025. "Industry classification misfits: identification, consequences and guidance," Review of Accounting Studies, Springer, vol. 30(4), pages 3295-3343, December.
    5. Andy Fodor & Randy D. Jorgensen & John D. Stowe, 2021. "Financial clusters, industry groups, and stock return correlations," Journal of Financial Research, Southern Finance Association;Southwestern Finance Association, vol. 44(1), pages 121-144, April.
    6. Schwenkler, G. & Zheng, H., 2025. "News-driven peer co-movement in crypto markets," Journal of Corporate Finance, Elsevier, vol. 93(C).
    7. Ge, S., 2020. "Text-Based Linkages and Local Risk Spillovers in the Equity Market," Cambridge Working Papers in Economics 20115, Faculty of Economics, University of Cambridge.
    8. Chen, Zilin & Guo, Li & Tu, Jun, 2021. "Media connection and return comovement," Journal of Economic Dynamics and Control, Elsevier, vol. 130(C).
    9. Aobdia, Daniel & Cheng, Lin, 2018. "Unionization, product market competition, and strategic disclosure," Journal of Accounting and Economics, Elsevier, vol. 65(2), pages 331-357.
    10. Samuel Ronnqvist & Peter Sarlin, 2015. "Detect & Describe: Deep learning of bank stress in the news," Papers 1507.07870, arXiv.org.
    11. Xia, Jingjing, 2023. "Redrawing the line: Narrowly beating analyst forecasts and journalists’ co-coverage choices in earnings-related news articles," Journal of Contemporary Accounting and Economics, Elsevier, vol. 19(3).
    12. You, Li & Zhang, Zongyi & Wang, Wei & Zhao, Xuezhou, 2024. "Learning from innovation award winners? Technology spillovers and firm innovation," International Review of Financial Analysis, Elsevier, vol. 92(C).
    13. Jingyu Li & Xiaoyan Yuan & Qiwei Xie & Guowen Li, 2026. "Homogeneity of corporate risk perceptions and systemic financial risk," Humanities and Social Sciences Communications, Palgrave Macmillan, vol. 13(1), pages 1-16, December.
    14. Zheng, Hannan & Schwenkler, Gustavo, 2020. "The network of firms implied by the news," ESRB Working Paper Series 108, European Systemic Risk Board.
    15. Samuel Ronnqvist & Peter Sarlin, 2016. "Bank distress in the news: Describing events through deep learning," Papers 1603.05670, arXiv.org, revised Dec 2016.
    16. Fang, Libing & Sun, Boyang & Li, Huijing & Yu, Honghai, 2018. "Systemic risk network of Chinese financial institutions," Emerging Markets Review, Elsevier, vol. 35(C), pages 190-206.
    17. Nan Li, 2025. "Labor market peer firms: understanding firms’ labor market linkages through employees’ internet “also viewed” firms," Review of Accounting Studies, Springer, vol. 30(1), pages 384-435, March.
    18. Yi Cao & Long Chen & Jennifer Wu Tucker & Chi Wan, 2025. "Can generative AI help identify peer firms?," Review of Accounting Studies, Springer, vol. 30(4), pages 3344-3386, December.
    19. Liu, Wei & Ma, Qianting & Liu, Xiaoxing, 2022. "Research on the dynamic evolution and its influencing factors of stock correlation network in the Chinese new energy market," Finance Research Letters, Elsevier, vol. 45(C).
    20. Zhibin Niu & Junqi Wu & Dawei Cheng & Jiawan Zhang, 2021. "Regshock: Interactive Visual Analytics of Systemic Risk in Financial Networks," Papers 2104.11863, arXiv.org.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2307.09332. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.