IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2509.01590.html
   My bibliography  Save this paper

Is All the Information in the Price? LLM Embeddings versus the EMH in Stock Clustering

Author

Listed:
  • Bingyang Wang
  • Grant Johnson
  • Maria Hybinette
  • Tucker Balch

Abstract

This paper investigates whether artificial intelligence can enhance stock clustering compared to traditional methods. We consider this in the context of the semi-strong Efficient Markets Hypothesis (EMH), which posits that prices fully reflect all public information and, accordingly, that clusters based on price information cannot be improved upon. We benchmark three clustering approaches: (i) price-based clusters derived from historical return correlations, (ii) human-informed clusters defined by the Global Industry Classification Standard (GICS), and (iii) AI-driven clusters constructed from large language model (LLM) embeddings of stock-related news headlines. At each date, each method provides a classification in which each stock is assigned to a cluster. To evaluate a clustering, we transform it into a synthetic factor model following the Arbitrage Pricing Theory (APT) framework. This enables consistent evaluation of predictive performance in a roll forward, out-of-sample test. Using S&P 500 constituents from from 2022 through 2024, we find that price-based clustering consistently outperforms both rule-based and AI-based methods, reducing root mean squared error (RMSE) by 15.9% relative to GICS and 14.7% relative to LLM embeddings. Our contributions are threefold: (i) a generalizable methodology that converts any equity grouping: manual, machine, or market-driven, into a real-time factor model for evaluation; (ii) the first direct comparison of price-based, human rule-based, and AI-based clustering under identical conditions; and (iii) empirical evidence reinforcing that short-horizon return information is largely contained in prices. These results support the EMH while offering practitioners a practical diagnostic for monitoring evolving sector structures and provide academics a framework for testing alternative hypotheses about how quickly markets absorb information.

Suggested Citation

  • Bingyang Wang & Grant Johnson & Maria Hybinette & Tucker Balch, 2025. "Is All the Information in the Price? LLM Embeddings versus the EMH in Stock Clustering," Papers 2509.01590, arXiv.org.
  • Handle: RePEc:arx:papers:2509.01590
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2509.01590
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Stephen A. Ross, 2013. "The Arbitrage Theory of Capital Asset Pricing," World Scientific Book Chapters, in: Leonard C MacLean & William T Ziemba (ed.), HANDBOOK OF THE FUNDAMENTALS OF FINANCIAL DECISION MAKING Part I, chapter 1, pages 11-30, World Scientific Publishing Co. Pte. Ltd..
    2. Yunjong Eo & Luis Uzeda & Benjamin Wong, 2023. "Understanding trend inflation through the lens of the goods and services sectors," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 38(5), pages 751-766, August.
    3. Taufiq Choudhry & Bashir Nur Osoble, 2015. "Nonlinear Interdependence Between the US and Emerging Markets' Industrial Stock Sectors," International Journal of Finance & Economics, John Wiley & Sons, Ltd., vol. 20(1), pages 61-79, January.
    4. Sanjeev Bhojraj & Charles M. C. Lee & Derek K. Oler, 2003. "What's My Line? A Comparison of Industry Classification Schemes for Capital Market Research," Journal of Accounting Research, John Wiley & Sons, Ltd., vol. 41(5), pages 745-774, December.
    5. Fama, Eugene F, 1970. "Efficient Capital Markets: A Review of Theory and Empirical Work," Journal of Finance, American Finance Association, vol. 25(2), pages 383-417, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Baoqiang Zhan & Shu Zhang & Helen S. Du & Xiaoguang Yang, 2022. "Exploring Statistical Arbitrage Opportunities Using Machine Learning Strategy," Computational Economics, Springer;Society for Computational Economics, vol. 60(3), pages 861-882, October.
    2. Shi, Huai-Long & Zhou, Wei-Xing, 2022. "Factor volatility spillover and its implications on factor premia," Journal of International Financial Markets, Institutions and Money, Elsevier, vol. 80(C).
    3. Sellin, Peter, 1998. "Monetary Policy and the Stock Market: Theory and Empirical Evidence," Working Paper Series 72, Sveriges Riksbank (Central Bank of Sweden).
    4. John H. Cochrane, 1999. "New facts in finance," Economic Perspectives, Federal Reserve Bank of Chicago, vol. 23(Q III), pages 36-58.
    5. Nathan Jensen, 2007. "International institutions and market expectations: Stock price responses to the WTO ruling on the 2002 U.S. steel tariffs," The Review of International Organizations, Springer, vol. 2(3), pages 261-280, September.
    6. Saggese, Pietro & Belmonte, Alessandro & Dimitri, Nicola & Facchini, Angelo & Böhme, Rainer, 2023. "Arbitrageurs in the Bitcoin ecosystem: Evidence from user-level trading patterns in the Mt. Gox exchange platform," Journal of Economic Behavior & Organization, Elsevier, vol. 213(C), pages 251-270.
    7. Guo, Weiwei & Intini, Silvia & Jahanshahloo, Hossein, 2025. "Bitcoin arbitrage and exchange default risk," Finance Research Letters, Elsevier, vol. 71(C).
    8. Robin Maximilian Stetzka & Stefan Winter, 2023. "How rational is gambling?," Journal of Economic Surveys, Wiley Blackwell, vol. 37(4), pages 1432-1488, September.
    9. Raphael Gwahula, 2018. "Examining Key Macroeconomic Factors Influencing the Stock Market Performance: Evidence from Tanzania," International Journal of Academic Research in Accounting, Finance and Management Sciences, Human Resource Management Academic Research Society, International Journal of Academic Research in Accounting, Finance and Management Sciences, vol. 8(2), pages 228-234, April.
    10. Gikas Hardouvelis & George Papanastasopoulos & Dimitrios D. Thomakos & Tao Wang, 2007. "Accruals, Net Stock Issues and Value-Glamour Anomalies: New Evidence on their Relation," Working Paper series 47_07, Rimini Centre for Economic Analysis.
    11. Adam Zaremba & Jacob Koby Shemer, 2018. "Price-Based Investment Strategies," Springer Books, Springer, number 978-3-319-91530-2, March.
    12. Gabriel Frahm, 0. "Arbitrage Pricing Theory In Ergodic Markets," International Journal of Theoretical and Applied Finance (IJTAF), World Scientific Publishing Co. Pte. Ltd., vol. 21(05), pages 1-28.
    13. Lucena, Pierre & Fugueiredo, Antonio Carlos, 2004. "Pressupostos de Eficiência de Mercado: um estudo empírico na Bovespa [Assumptions of Market Efficiency: an empirical analysis at Bovespa/Brazil]," MPRA Paper 40884, University Library of Munich, Germany.
    14. Tobias Wiest, 2023. "Momentum: what do we know 30 years after Jegadeesh and Titman’s seminal paper?," Financial Markets and Portfolio Management, Springer;Swiss Society for Financial Market Research, vol. 37(1), pages 95-114, March.
    15. Carmen López-Martín & Sonia Benito Muela & Raquel Arguedas, 2021. "Efficiency in cryptocurrency markets: new evidence," Eurasian Economic Review, Springer;Eurasia Business and Economics Society, vol. 11(3), pages 403-431, September.
    16. Stefan Nagel, 2013. "Empirical Cross-Sectional Asset Pricing," Annual Review of Financial Economics, Annual Reviews, vol. 5(1), pages 167-199, November.
    17. Bebchuk, Lucian A. & Cohen, Alma & Wang, Charles C.Y., 2013. "Learning and the disappearing association between governance and returns," Journal of Financial Economics, Elsevier, vol. 108(2), pages 323-348.
    18. Abdull-Baseet Abusaba & Dr. Muba Seif & Dr. Komba Gabriel, 2025. "Impact of Macroeconomic Variables on Stock Market Prices in Sub-Saharan Africa," International Journal of Health, Medicine and Nursing Practice, CARI Journals Limited, vol. 7(1), pages 24-44.
    19. Attiya Yasmeen Javid, 2000. "Alternative Capital Asset Pricing Models: A Review of Theory and Evidence," PIDE Research Report 2000:3, Pakistan Institute of Development Economics.
    20. You‐How Go & Wee‐Yeap Lau, 2023. "What do we know about informational efficiency? Three puzzles and the new direction forward," Journal of Economic Surveys, Wiley Blackwell, vol. 37(4), pages 1489-1525, September.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2509.01590. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.