IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2507.01970.html

News Sentiment Embeddings for Stock Price Forecasting

Author

Listed:
  • Ayaan Qayyum

Abstract

This paper will discuss how headline data can be used to predict stock prices. The stock price in question is the SPDR S&P 500 ETF Trust, also known as SPY that tracks the performance of the largest 500 publicly traded corporations in the United States. A key focus is to use news headlines from the Wall Street Journal (WSJ) to predict the movement of stock prices on a daily timescale with OpenAI-based text embedding models used to create vector encodings of each headline with principal component analysis (PCA) to exact the key features. The challenge of this work is to capture the time-dependent and time-independent, nuanced impacts of news on stock prices while handling potential lag effects and market noise. Financial and economic data were collected to improve model performance; such sources include the U.S. Dollar Index (DXY) and Treasury Interest Yields. Over 390 machine-learning inference models were trained. The preliminary results show that headline data embeddings greatly benefit stock price prediction by at least 40% compared to training and optimizing a machine learning system without headline data embeddings.

Suggested Citation

  • Ayaan Qayyum, 2025. "News Sentiment Embeddings for Stock Price Forecasting," Papers 2507.01970, arXiv.org.
  • Handle: RePEc:arx:papers:2507.01970
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2507.01970
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Krauss, Christopher & Do, Xuan Anh & Huck, Nicolas, 2017. "Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500," European Journal of Operational Research, Elsevier, vol. 259(2), pages 689-702.
    2. Christopher Krauss & Anh Do & Nicolas Huck, 2017. "Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500," Post-Print hal-01768895, HAL.
    3. Paul C. Tetlock, 2007. "Giving Content to Investor Sentiment: The Role of Media in the Stock Market," Journal of Finance, American Finance Association, vol. 62(3), pages 1139-1168, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kamaladdin Fataliyev & Aneesh Chivukula & Mukesh Prasad & Wei Liu, 2021. "Stock Market Analysis with Text Data: A Review," Papers 2106.12985, arXiv.org, revised Jul 2021.
    2. Schnaubelt, Matthias & Fischer, Thomas G. & Krauss, Christopher, 2018. "Separating the signal from the noise - financial machine learning for Twitter," FAU Discussion Papers in Economics 14/2018, Friedrich-Alexander University Erlangen-Nuremberg, Institute for Economics.
    3. Schnaubelt, Matthias & Fischer, Thomas G. & Krauss, Christopher, 2020. "Separating the signal from the noise – Financial machine learning for Twitter," Journal of Economic Dynamics and Control, Elsevier, vol. 114(C).
    4. Weiguang Han & Boyi Zhang & Qianqian Xie & Min Peng & Yanzhao Lai & Jimin Huang, 2023. "Select and Trade: Towards Unified Pair Trading with Hierarchical Reinforcement Learning," Papers 2301.10724, arXiv.org, revised Feb 2023.
    5. Baoqiang Zhan & Shu Zhang & Helen S. Du & Xiaoguang Yang, 2022. "Exploring Statistical Arbitrage Opportunities Using Machine Learning Strategy," Computational Economics, Springer;Society for Computational Economics, vol. 60(3), pages 861-882, October.
    6. Marcos Delprato, 2025. "Private and public school efficiency gaps in Latin America-A combined DEA and machine learning approach based on PISA 2022," Papers 2509.25353, arXiv.org.
    7. Kentaro Imajo & Kentaro Minami & Katsuya Ito & Kei Nakagawa, 2020. "Deep Portfolio Optimization via Distributional Prediction of Residual Factors," Papers 2012.07245, arXiv.org.
    8. Fischer, Thomas G., 2018. "Reinforcement learning in financial markets - a survey," FAU Discussion Papers in Economics 12/2018, Friedrich-Alexander University Erlangen-Nuremberg, Institute for Economics.
    9. Qian, Yihe & Zhang, Yang, 2025. "Long-term forecasting in asset pricing: Machine learning models’ sensitivity to macroeconomic shifts and firm-specific factors," The North American Journal of Economics and Finance, Elsevier, vol. 78(C).
    10. Knoll, Julian & Stübinger, Johannes & Grottke, Michael, 2017. "Exploiting social media with higher-order Factorization Machines: Statistical arbitrage on high-frequency data of the S&P 500," FAU Discussion Papers in Economics 13/2017, Friedrich-Alexander University Erlangen-Nuremberg, Institute for Economics.
    11. Moews, Ben & Ibikunle, Gbenga, 2020. "Predictive intraday correlations in stable and volatile market environments: Evidence from deep learning," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 547(C).
    12. Guillaume Coqueret & Tony Guida, 2020. "Training trees on tails with applications to portfolio choice," Post-Print hal-04144665, HAL.
    13. Zhou, Hao & Kalev, Petko S., 2019. "Algorithmic and high frequency trading in Asia-Pacific, now and the future," Pacific-Basin Finance Journal, Elsevier, vol. 53(C), pages 186-207.
    14. Fischer, Thomas & Krauss, Christopher, 2017. "Deep learning with long short-term memory networks for financial market predictions," FAU Discussion Papers in Economics 11/2017, Friedrich-Alexander University Erlangen-Nuremberg, Institute for Economics.
    15. Mercadier, Mathieu & Lardy, Jean-Pierre, 2019. "Credit spread approximation and improvement using random forest regression," European Journal of Operational Research, Elsevier, vol. 277(1), pages 351-365.
    16. Nazemi, Abdolreza & Rezazadeh, Hani & Fabozzi, Frank J. & Höchstötter, Markus, 2022. "Deep learning for modeling the collection rate for third-party buyers," International Journal of Forecasting, Elsevier, vol. 38(1), pages 240-252.
    17. Alexander Jakob Dautel & Wolfgang Karl Härdle & Stefan Lessmann & Hsin-Vonn Seow, 2020. "Forex exchange rate forecasting using deep recurrent neural networks," Digital Finance, Springer, vol. 2(1), pages 69-96, September.
    18. Flori, Andrea & Regoli, Daniele, 2021. "Revealing Pairs-trading opportunities with long short-term memory networks," European Journal of Operational Research, Elsevier, vol. 295(2), pages 772-791.
    19. Kriebel, Johannes & Stitz, Lennart, 2022. "Credit default prediction from user-generated text in peer-to-peer lending using deep learning," European Journal of Operational Research, Elsevier, vol. 302(1), pages 309-323.
    20. Mert Edali, 2022. "Pattern‐oriented analysis of system dynamics models via random forests," System Dynamics Review, System Dynamics Society, vol. 38(2), pages 135-166, April.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2507.01970. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.