IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2503.16974.html
   My bibliography  Save this paper

Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

Author

Listed:
  • Julian Junyan Wang
  • Victor Xiaoqi Wang

Abstract

This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MD&As, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not consistently demonstrate better consistency and reproducibility, with task-specific patterns emerging. LLMs significantly outperform expert human annotators in consistency and maintain high agreement even where human experts significantly disagree. We further find that simple aggregation strategies across 3-5 runs dramatically improve consistency. We also find that aggregation may come with an additional benefit of improved accuracy for sentiment analysis when using newer models. Simulation analysis reveals that despite measurable inconsistency in LLM outputs, downstream statistical inferences remain remarkably robust. These findings address concerns about what we term "G-hacking," the selective reporting of favorable outcomes from multiple Generative AI runs, by demonstrating that such risks are relatively low for finance and accounting tasks.

Suggested Citation

  • Julian Junyan Wang & Victor Xiaoqi Wang, 2025. "Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks," Papers 2503.16974, arXiv.org, revised Mar 2025.
  • Handle: RePEc:arx:papers:2503.16974
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2503.16974
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Boyang Chen & Zongxiao Wu & Ruoran Zhao, 2023. "From fiction to fact: the growing role of generative AI in business and finance," Journal of Chinese Economic and Business Studies, Taylor & Francis Journals, vol. 21(4), pages 471-496, October.
    2. Juhani T Linnainmaa & Michael R Roberts, 2018. "The History of the Cross-Section of Stock Returns," The Review of Financial Studies, Society for Financial Studies, vol. 31(7), pages 2606-2649.
    3. Alejandro Lopez-Lira & Yuehua Tang, 2023. "Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models," Papers 2304.07619, arXiv.org, revised Sep 2024.
    4. Tim Loughran & Bill Mcdonald, 2014. "Measuring Readability in Financial Disclosures," Journal of Finance, American Finance Association, vol. 69(4), pages 1643-1671, August.
    5. Julian Junyan Wang & Victor Xiaoqi Wang, 2024. "Leveraging Large Language Models to Democratize Access to Costly Financial Datasets for Academic Research," Papers 2412.02065, arXiv.org.
    6. Paul Glasserman & Caden Lin, 2023. "Assessing Look-Ahead Bias in Stock Return Predictions Generated By GPT Sentiment Analysis," Papers 2309.17322, arXiv.org.
    7. Christophe Pérignon & Olivier Akmansoy & Christophe Hurlin & Anna Dreber & Felix Holzmeister & Jürgen Huber & Magnus Johannesson & Michael Kirchler & Albert J Menkveld & Michael Razen & Utz Weitzel, 2024. "Computational Reproducibility in Finance: Evidence from 1,000 Tests," The Review of Financial Studies, Society for Financial Studies, vol. 37(11), pages 3558-3593.
    8. Leippold, Markus, 2023. "Sentiment spin: Attacking financial sentiment with GPT-3," Finance Research Letters, Elsevier, vol. 55(PB).
    9. Markus Leippold, 2023. "Sentiment Spin: Attacking Financial Sentiment with GPT-3," Swiss Finance Institute Research Paper Series 23-11, Swiss Finance Institute.
    10. Li, Feng, 2008. "Annual report readability, current earnings, and earnings persistence," Journal of Accounting and Economics, Elsevier, vol. 45(2-3), pages 221-247, August.
    11. Pekka Malo & Ankur Sinha & Pekka Korhonen & Jyrki Wallenius & Pyry Takala, 2014. "Good debt or bad debt: Detecting semantic orientations in economic texts," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 65(4), pages 782-796, April.
    12. Kirtac, Kemal & Germano, Guido, 2024. "Sentiment trading with large language models," Finance Research Letters, Elsevier, vol. 62(PB).
    13. Andrew Y. Chen & Tom Zimmermann, 2022. "Open Source Cross-Sectional Asset Pricing," Critical Finance Review, now publishers, vol. 11(2), pages 207-264, May.
    14. Kewei Hou & Chen Xue & Lu Zhang, 2020. "Replicating Anomalies," The Review of Financial Studies, Society for Financial Studies, vol. 33(5), pages 2019-2133.
    15. Susana Álvarez-Díez & J. Samuel Baixauli-Soler & Anna Kondratenko & Gabriel Lozano-Reina, 2024. "Dividend announcement and the value of sentiment analysis," Journal of Management Analytics, Taylor & Francis Journals, vol. 11(2), pages 161-181, April.
    16. Alonso-Robisco, Andres & Carbó, José Manuel, 2023. "Analysis of CBDC narrative by central banks using large language models," Finance Research Letters, Elsevier, vol. 58(PC).
    17. Smales, Lee A., 2023. "Classification of RBA monetary policy announcements using ChatGPT," Finance Research Letters, Elsevier, vol. 58(PC).
    18. Edward Li & Zhiyuan Tu & Dexin Zhou, 2024. "The Promise and Peril of Generative AI: Evidence from GPT-4 as Sell-Side Analysts," Papers 2412.01069, arXiv.org.
    19. Alex Kim & Maximilian Muhn & Valeri Nikolaev, 2024. "Financial Statement Analysis with Large Language Models," Papers 2407.17866, arXiv.org, revised Feb 2025.
    20. Tim Loughran & Bill Mcdonald, 2011. "When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‐Ks," Journal of Finance, American Finance Association, vol. 66(1), pages 35-65, February.
    21. Feng Li, 2010. "The Information Content of Forward‐Looking Statements in Corporate Filings—A Naïve Bayesian Machine Learning Approach," Journal of Accounting Research, Wiley Blackwell, vol. 48(5), pages 1049-1102, December.
    22. Harvey, Campbell R., 2019. "Editorial: Replication in Financial Economics," Critical Finance Review, now publishers, vol. 8(1-2), pages 1-9, December.
    23. Luzi Hail & Mark Lang & Christian Leuz, 2020. "Reproducibility in Accounting Research: Views of the Research Community," Journal of Accounting Research, Wiley Blackwell, vol. 58(2), pages 519-543, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ingrid E. Fisher & Margaret R. Garnsey & Mark E. Hughes, 2016. "Natural Language Processing in Accounting, Auditing and Finance: A Synthesis of the Literature with a Roadmap for Future Research," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 23(3), pages 157-214, July.
    2. Dong, Mengming Michael & Stratopoulos, Theophanis C. & Wang, Victor Xiaoqi, 2024. "A scoping review of ChatGPT research in accounting and finance," International Journal of Accounting Information Systems, Elsevier, vol. 55(C).
    3. Yuqi Nie & Yaxuan Kong & Xiaowen Dong & John M. Mulvey & H. Vincent Poor & Qingsong Wen & Stefan Zohren, 2024. "A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges," Papers 2406.11903, arXiv.org.
    4. Liu, Pu & Nguyen, Hazel T., 2020. "CEO characteristics and tone at the top inconsistency," Journal of Economics and Business, Elsevier, vol. 108(C).
    5. Huijue Kelly Duan & Hanxin Hu & Yangin (Ben) Yoon & Miklos Vasarhelyi, 2022. "Increasing the utility of performance audit reports: Using textual analytics tools to improve government reporting," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 29(4), pages 201-218, October.
    6. Blankespoor, Elizabeth & deHaan, Ed & Marinovic, Iván, 2020. "Disclosure processing costs, investors’ information choice, and equity market outcomes: A review," Journal of Accounting and Economics, Elsevier, vol. 70(2).
    7. Brian J. Bushee & Ian D. Gow & Daniel J. Taylor, 2018. "Linguistic Complexity in Firm Disclosures: Obfuscation or Information?," Journal of Accounting Research, Wiley Blackwell, vol. 56(1), pages 85-121, March.
    8. Frankel, Richard & Jennings, Jared & Lee, Joshua, 2016. "Using unstructured and qualitative disclosures to explain accruals," Journal of Accounting and Economics, Elsevier, vol. 62(2), pages 209-227.
    9. Volkan Muslu & Sunay Mutlu & Suresh Radhakrishnan & Albert Tsang, 2019. "Corporate Social Responsibility Report Narratives and Analyst Forecast Accuracy," Journal of Business Ethics, Springer, vol. 154(4), pages 1119-1142, February.
    10. Bakarich, Kathleen M. & Hossain, Mahmud & Hossain, Mahmud & Weintrop, Joseph, 2019. "Different time, different tone: Company life cycle," Journal of Contemporary Accounting and Economics, Elsevier, vol. 15(1), pages 69-86.
    11. Wolfgang Breuer & Andreas Knetsch & Astrid Juliane Salzmann, 2020. "What Does It Mean When Managers Talk About Trust?," Journal of Business Ethics, Springer, vol. 166(3), pages 473-488, October.
    12. Sui, Cong & Wang, Shuhan & Zheng, Wei, 2024. "Sentiment as a shipping market predictor: Testing market-specific language models," Transportation Research Part E: Logistics and Transportation Review, Elsevier, vol. 189(C).
    13. Song, Piaopeng & Lu, Hanglin & Zhang, Yongjie, 2024. "Unveiling tone manipulation in MD&A: Evidence from ChatGPT experiments," Finance Research Letters, Elsevier, vol. 67(PA).
    14. Eachempati, Prajwal & Srivastava, Praveen Ranjan & Kumar, Ajay & Tan, Kim Hua & Gupta, Shivam, 2021. "Validating the impact of accounting disclosures on stock market: A deep neural network approach," Technological Forecasting and Social Change, Elsevier, vol. 170(C).
    15. Richard Frankel & Jared Jennings & Joshua Lee, 2022. "Disclosure Sentiment: Machine Learning vs. Dictionary Methods," Management Science, INFORMS, vol. 68(7), pages 5514-5532, July.
    16. Acheampong, Albert & Elshandidy, Tamer, 2021. "Does soft information determine credit risk? Text-based evidence from European banks," Journal of International Financial Markets, Institutions and Money, Elsevier, vol. 75(C).
    17. Kemal Kirtac & Guido Germano, 2025. "Large language models in finance : what is financial sentiment?," Papers 2503.03612, arXiv.org, revised Mar 2025.
    18. Kalelkar, Rachana & Xu, Hongkang & Nguyen, Duong & Chen, Zheng, 2024. "Generalist CEOs and the readability of the 10-K report," Advances in accounting, Elsevier, vol. 65(C).
    19. Craja, Patricia & Kim, Alisa & Lessmann, Stefan, 2020. "Deep Learning application for fraud detection in financial statements," IRTG 1792 Discussion Papers 2020-007, Humboldt University of Berlin, International Research Training Group 1792 "High Dimensional Nonstationary Time Series".
    20. John Donovan & Jared Jennings & Kevin Koharki & Joshua Lee, 2021. "Measuring credit risk using qualitative disclosure," Review of Accounting Studies, Springer, vol. 26(2), pages 815-863, June.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2503.16974. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.