IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2502.15865.html
   My bibliography  Save this paper

Position: Standard Benchmarks Fail -- LLM Agents Present Overlooked Risks for Financial Applications

Author

Listed:
  • Zichen Chen
  • Jiaao Chen
  • Jianda Chen
  • Misha Sra

Abstract

Current financial LLM agent benchmarks are inadequate. They prioritize task performance while ignoring fundamental safety risks. Threats like hallucinations, temporal misalignment, and adversarial vulnerabilities pose systemic risks in high-stakes financial environments, yet existing evaluation frameworks fail to capture these risks. We take a firm position: traditional benchmarks are insufficient to ensure the reliability of LLM agents in finance. To address this, we analyze existing financial LLM agent benchmarks, finding safety gaps and introducing ten risk-aware evaluation metrics. Through an empirical evaluation of both API-based and open-weight LLM agents, we reveal hidden vulnerabilities that remain undetected by conventional assessments. To move the field forward, we propose the Safety-Aware Evaluation Agent (SAEA), grounded in a three-level evaluation framework that assesses agents at the model level (intrinsic capabilities), workflow level (multi-step process reliability), and system level (integration robustness). Our findings highlight the urgent need to redefine LLM agent evaluation standards by shifting the focus from raw performance to safety, robustness, and real world resilience.

Suggested Citation

  • Zichen Chen & Jiaao Chen & Jianda Chen & Misha Sra, 2025. "Position: Standard Benchmarks Fail -- LLM Agents Present Overlooked Risks for Financial Applications," Papers 2502.15865, arXiv.org.
  • Handle: RePEc:arx:papers:2502.15865
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2502.15865
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Haohang Li & Yupeng Cao & Yangyang Yu & Shashidhar Reddy Javaji & Zhiyang Deng & Yueru He & Yuechen Jiang & Zining Zhu & Koduvayur Subbalakshmi & Guojun Xiong & Jimin Huang & Lingfei Qian & Xueqing Pe, 2024. "INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent," Papers 2412.18174, arXiv.org.
    2. Carmen M. Reinhart & Kenneth S. Rogoff, 2009. "Is the 2007 US Sub-Prime Financial Crisis So Different?: An International Historical Comparison," Panoeconomicus, Savez ekonomista Vojvodine, Novi Sad, Serbia, vol. 56(3), pages 291-299.
    3. Han Ding & Yinheng Li & Junhao Wang & Hang Chen, 2024. "Large Language Model Agent in Financial Trading: A Survey," Papers 2408.06361, arXiv.org.
    4. Alex Kim & Maximilian Muhn & Valeri Nikolaev, 2024. "Financial Statement Analysis with Large Language Models," Papers 2407.17866, arXiv.org, revised Feb 2025.
    5. Peter Nystrup & Henrik Madsen & Erik Lindström, 2018. "Dynamic portfolio optimization across hidden market regimes," Quantitative Finance, Taylor & Francis Journals, vol. 18(1), pages 83-95, January.
    6. Tolulope Fadina & Yang Liu & Ruodu Wang, 2024. "A framework for measures of risk under uncertainty," Finance and Stochastics, Springer, vol. 28(2), pages 363-390, April.
    7. Brian F. Tivnan & David Slater & James R. Thompson & Tobin A. Bergen-Hill & Carl D. Burke & Shaun M. Brady & Matthew T. K. Koehler & Matthew T. McMahon & Brendan F. Tivnan & Jason Veneman, 2018. "Price Discovery and the Accuracy of Consolidated Data Feeds in the U.S. Equity Markets," Papers 1810.11091, arXiv.org.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shijie Han & Changhai Zhou & Yiqing Shen & Tianning Sun & Yuhua Zhou & Xiaoxia Wang & Zhixiao Yang & Jingshu Zhang & Hongguang Li, 2025. "FinSphere: A Conversational Stock Analysis Agent Equipped with Quantitative Tools based on Real-Time Database," Papers 2501.12399, arXiv.org.
    2. Nikolay Hristov & Markus Roth, 2019. "Uncertainty Shocks and Financial Crisis Indicators," CESifo Working Paper Series 7839, CESifo.
    3. Daisuke Ikeda & Toan Phan & Timothy Sablik, 2020. "Asset Bubbles and Global Imbalances," Richmond Fed Economic Brief, Federal Reserve Bank of Richmond, vol. 20, pages 1-4, January.
    4. Alessandra Canepa & Fawaz Khaled, 2018. "Housing, Housing Finance and Credit Risk," IJFS, MDPI, vol. 6(2), pages 1-23, May.
    5. Carmen M. Reinhart & Kenneth S. Rogoff, 2014. "Recovery from Financial Crises: Evidence from 100 Episodes," American Economic Review, American Economic Association, vol. 104(5), pages 50-55, May.
    6. Hertrich Markus, 2019. "A Novel Housing Price Misalignment Indicator for Germany," German Economic Review, De Gruyter, vol. 20(4), pages 759-794, December.
    7. Roy, Saktinil & Kemme, David M., 2012. "Causes of banking crises: Deregulation, credit booms and asset bubbles, then and now," International Review of Economics & Finance, Elsevier, vol. 24(C), pages 270-294.
    8. David Lodge & Marta Rodriguez-Vives, 2013. "How long can austerity persist? The factors that sustain fiscal consolidations," European Journal of Government and Economics, Europa Grande, vol. 2(1), pages 5-24, June.
    9. R. Barrell & D. Karim & C. Macchiarelli, 2020. "Towards an understanding of credit cycles: do all credit booms cause crises?," The European Journal of Finance, Taylor & Francis Journals, vol. 26(10), pages 978-993, July.
    10. Bofinger, Peter & Franz, Wolfgang & Schmidt, Christoph M. & Weder di Mauro, Beatrice & Wiegard, Wolfgang, 2010. "Chancen für einen stabilen Aufschwung. Jahresgutachten 2010/11 [Chances for a stable upturn. Annual Report 2010/11]," Annual Economic Reports / Jahresgutachten, German Council of Economic Experts / Sachverständigenrat zur Begutachtung der gesamtwirtschaftlichen Entwicklung, volume 127, number 201011, September.
    11. Stijn Claessens & M. Ayhan Kose, 2013. "Financial Crises: Explanations, Types and Implications," CAMA Working Papers 2013-06, Centre for Applied Macroeconomic Analysis, Crawford School of Public Policy, The Australian National University.
    12. Hyein Shim & Maria H. Kim & Doojin Ryu, 2017. "Effects of intraday weather changes on asset returns and volatilities," Zbornik radova Ekonomskog fakulteta u Rijeci/Proceedings of Rijeka Faculty of Economics, University of Rijeka, Faculty of Economics and Business, vol. 35(2), pages 301-330.
    13. Yizhan Shu & Chenyu Yu & John M. Mulvey, 2024. "Downside risk reduction using regime-switching signals: a statistical jump model approach," Journal of Asset Management, Palgrave Macmillan, vol. 25(5), pages 493-507, September.
    14. Egon Smeral, 2009. "Mögliche Auswirkungen der Finanz- und Konjunkturkrise auf den österreichischen Tourismus," WIFO Studies, WIFO, number 34879.
    15. Guillermo Calvo & Fabrizio Coricelli & Pablo Ottonello, 2014. "Jobless Recoveries during Financial Crises: Is Inflation the Way Out?," Central Banking, Analysis, and Economic Policies Book Series, in: Sofía Bauducco & Lawrence Christiano & Claudio Raddatz (ed.),Macroeconomic and Financial Stability: challenges for Monetary Policy, edition 1, volume 19, chapter 11, pages 331-381, Central Bank of Chile.
    16. Elsas, Ralf & Hackethal, Andreas & Holzhäuser, Markus, 2010. "The anatomy of bank diversification," Journal of Banking & Finance, Elsevier, vol. 34(6), pages 1274-1287, June.
    17. Gerard Caprio & Patrick Honohan, 2008. "Banking Crises," Center for Development Economics 2008-09, Department of Economics, Williams College.
    18. Robert Fay & James Ketcheson, 2016. "The US Labour Market: How Much Slack Remains?," Staff Analytical Notes 16-9, Bank of Canada.
    19. Pais, Amelia & Stork, Philip A., 2011. "Contagion risk in the Australian banking and property sectors," Journal of Banking & Finance, Elsevier, vol. 35(3), pages 681-697, March.
    20. Pieter A. Gautier, 2009. "Coordination Frictions and The Financial Crisis," Tinbergen Institute Discussion Papers 09-028/3, Tinbergen Institute.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2502.15865. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.