IDEAS home Printed from https://ideas.repec.org/a/eee/riibaf/v79y2025ics0275531925003356.html

Extraction of characteristic information from financial super-long texts and prediction of corporate violations

Author

Listed:
  • Lu, Hanglin
  • Zhang, Yongjie
  • Xu, Jinchang

Abstract

Annual report texts contain clues about corporate misconduct. Predicting misconduct through AI-based analysis of these texts can help investors better avoid risks. However, due to the current limitations of AI language models, embedding the semantic vectors of long text paragraphs from annual reports faces a trade-off between "globality" and "accuracy." By using machine learning models (DecisionTree, RandomForest, LightGBM), our study compares the effectiveness of annual report text information at four segmentation granularities in predicting corporate misconduct. We find that, with single-granularity encoding, the Bert-Sentence-Stack semantic extraction method provides more effective annual report text encodings for predicting misconduct, achieving a best AUC of 0.7250. Furthermore, by implementing multi-granularity feature fusion, we achieve a winning combination of "globality" and "accuracy" with a maximum AUC of 0.7701. Compared to using financial features alone, multi-granularity text feature fusion increases the prediction AUC for corporate misconduct by about 12 %, indicating that multi-granularity text semantic features provide valuable incremental information. This study offers new insights and solutions for the integration and utilization of long financial texts and information mining.

Suggested Citation

  • Lu, Hanglin & Zhang, Yongjie & Xu, Jinchang, 2025. "Extraction of characteristic information from financial super-long texts and prediction of corporate violations," Research in International Business and Finance, Elsevier, vol. 79(C).
  • Handle: RePEc:eee:riibaf:v:79:y:2025:i:c:s0275531925003356
    DOI: 10.1016/j.ribaf.2025.103079
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0275531925003356
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ribaf.2025.103079?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Sunita Goel & Jagdish Gangolly, 2012. "Beyond The Numbers: Mining The Annual Reports For Hidden Cues Indicative Of Financial Statement Fraud," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 19(2), pages 75-89, April.
    2. Zhao, Jing & Zhao, Liang & Tan, Haoyu & Li, Huxing, 2024. "Independent directors' performance behavior and corporate violations," Finance Research Letters, Elsevier, vol. 69(PB).
    3. Ruijie Sun & Feng Liu & Yinan Li & Rongping Wang & Jing Luo, 2024. "Machine Learning for Predicting Corporate Violations: How Do CEO Characteristics Matter?," Journal of Business Ethics, Springer, vol. 195(1), pages 151-166, November.
    4. Shi Qiu & Yuansheng Luo & Hongwei Guo, 2021. "Multisource evidence theory‐based fraud risk assessment of China's listed companies," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 40(8), pages 1524-1539, December.
    5. Zhang, Yi & Liu, Tianxiang & Li, Weiping, 2024. "Corporate fraud detection based on linguistic readability vector: Application to financial companies in China," International Review of Financial Analysis, Elsevier, vol. 95(PB).
    6. Philipp Borchert & Kristof Coussement & Jochen de Weerdt & Arno de Caigny, 2024. "Industry-sensitive language modeling for business," Post-Print hal-04542524, HAL.
    7. Xin Xu & Feng Xiong & Zhe An, 2023. "Using Machine Learning to Predict Corporate Fraud: Evidence Based on the GONE Framework," Journal of Business Ethics, Springer, vol. 186(1), pages 137-158, August.
    8. Liu, Chelsea, 2018. "Are women greener? Corporate gender diversity and environmental violations," Journal of Corporate Finance, Elsevier, vol. 52(C), pages 118-142.
    9. Eugster, Nicolas & Kowalewski, Oskar & Śpiewanowski, Piotr, 2024. "Internal governance mechanisms and corporate misconduct," International Review of Financial Analysis, Elsevier, vol. 92(C).
    10. Chen, Donghua & Chen, Yinying & Li, Oliver Zhen & Ni, Chenkai, 2018. "Foreign residency rights and corporate fraud," Journal of Corporate Finance, Elsevier, vol. 51(C), pages 142-163.
    11. Zaman, Rashid, 2024. "When corporate culture matters: The case of stakeholder violations," The British Accounting Review, Elsevier, vol. 56(1).
    12. Messod D. Beneish, 1999. "The Detection of Earnings Manipulation," Financial Analysts Journal, Taylor & Francis Journals, vol. 55(5), pages 24-36, September.
    13. Xue, Lixing & Chen, Chong & Wang, Na & Zhang, Lirong, 2023. "Gambling culture and corporate financialization: Evidence from China's welfare lottery sales," Pacific-Basin Finance Journal, Elsevier, vol. 78(C).
    14. Tsung-Kang Chen & Yijie Tseng, 2021. "Readability of Notes to Consolidated Financial Statements and Corporate Bond Yield Spread," European Accounting Review, Taylor & Francis Journals, vol. 30(1), pages 83-113, January.
    15. de Haan, Evert & Padigar, Manjunath & El Kihal, Siham & Kübler, Raoul & Wieringa, Jaap E., 2024. "Unstructured data research in business: Toward a structured approach," Journal of Business Research, Elsevier, vol. 177(C).
    16. Rashid Zaman & Nader Atawnah & Muhammad Nadeem & Stephen Bahadar & Irfan Haider Shakri, 2022. "Do liquid assets lure managers? Evidence from corporate misconduct," Journal of Business Finance & Accounting, Wiley Blackwell, vol. 49(7-8), pages 1425-1453, July.
    17. Dennis W. Campbell & Ruidi Shang, 2022. "Tone at the Bottom: Measuring Corporate Misconduct Risk from the Text of Employee Reviews," Management Science, INFORMS, vol. 68(9), pages 7034-7053, September.
    18. Liu, Xiaoding, 2016. "Corruption culture and corporate misconduct," Journal of Financial Economics, Elsevier, vol. 122(2), pages 307-327.
    19. Li, Jingyu & Guo, Ce & Lv, Sijia & Xie, Qiwei & Zheng, Xiaolong, 2024. "Financial fraud detection for Chinese listed firms: Does managers' abnormal tone matter?," Emerging Markets Review, Elsevier, vol. 62(C).
    20. X. D. Xu & S. X. Zeng & H. L. Zou & Jonathan J. Shi, 2016. "The Impact of Corporate Environmental Violation on Shareholders' Wealth: a Perspective Taken from Media Coverage," Business Strategy and the Environment, Wiley Blackwell, vol. 25(2), pages 73-91, February.
    21. Borchert, Philipp & Coussement, Kristof & De Weerdt, Jochen & De Caigny, Arno, 2024. "Industry-sensitive language modeling for business," European Journal of Operational Research, Elsevier, vol. 315(2), pages 691-702.
    22. Tim Loughran & Bill Mcdonald, 2011. "When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‐Ks," Journal of Finance, American Finance Association, vol. 66(1), pages 35-65, February.
    23. Chen, Yunyan & Wu, Shinong & Zhou, Yucheng & Huo, Di, 2023. "Gambling culture and corporate violations: Evidence from China," Pacific-Basin Finance Journal, Elsevier, vol. 80(C).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhang, Zejun & Wang, Zhao & Cai, Lixin, 2025. "Predicting financial fraud in Chinese listed companies: An enterprise portrait and machine learning approach," Pacific-Basin Finance Journal, Elsevier, vol. 90(C).
    2. Meng Luo & Chaoqun Ma & Dongqing Chen & Xianhua Mi, 2026. "Financial Statement Fraud Detection by Integrating Supervisory Punishment Reports Into Machine Learning Methods: Evidence From China," Accounting and Finance, Accounting and Finance Association of Australia and New Zealand, vol. 66(1), pages 165-177, March.
    3. Huang, Shirley Hsueh-Li & Hu, Guo-Hsin & Hsu, Ming-Fu, 2025. "Identifying contextual content-based risk drivers for advanced risk management strategies," Research in International Business and Finance, Elsevier, vol. 73(PB).
    4. Wang, Wenjiao & Sun, Ziyuan & Wang, Lan, 2025. "Does ESG rating divergence exacerbate management tone manipulation? − Empirical evidence based on MD&A text," Journal of Business Research, Elsevier, vol. 197(C).
    5. Qi, Yu & Su, Hang, 2025. "Can artificial intelligence mitigate corporate fraud? Exploring the influence of institutional cross-holdings and financial misallocation," Pacific-Basin Finance Journal, Elsevier, vol. 92(C).
    6. Ziqiao Wang & Wei Zhang & Feng He & Xin Huang, 2026. "Turning the Wheels of Justice: How Judicial Reforms Deter Corporate Misconduct in China," Journal of Business Ethics, Springer, vol. 204(2), pages 243-272, March.
    7. Dong, Jinting & Liu, Bin & Chen, Yinying, 2024. "Top managers' environmental experience and corporate environmental violations: Evidence from China," International Review of Financial Analysis, Elsevier, vol. 95(PB).
    8. Xiaoqian Zhu & Huidong Wu & Yanpeng Chang & Jianping Li, 2025. "Accounting fraud detection through textual risk disclosures in annual reports: From the perspective of SEC guidelines," Accounting and Finance, Accounting and Finance Association of Australia and New Zealand, vol. 65(2), pages 1837-1862, June.
    9. Wang, Sumingyue & Wang, Xinlu & Xu, Liang, 2023. "Debt maturity structure and the quality of risk disclosures," Journal of Corporate Finance, Elsevier, vol. 83(C).
    10. Unsal, Omer & Hippler, William J., 2024. "Corporate misconduct and innovation: Evidence from the pharmaceutical industry," Research in International Business and Finance, Elsevier, vol. 71(C).
    11. Lutfa Tilat Ferdous & Tarek Rana & Richard Yeboah, 2025. "Decoding the impact of firm‐level ESG performance on financial disclosure quality," Business Strategy and the Environment, Wiley Blackwell, vol. 34(1), pages 162-186, January.
    12. Hasan, Mostafa Monzur & Bhuiyan, Md Borhan Uddin & Taylor, Grantley, 2025. "Reprint of: Corporate culture and carbon emission performance," The British Accounting Review, Elsevier, vol. 57(1).
    13. Luo, Yating & Zhang, Naiqian & Tong, Tong & Jia, Xiaofei, 2025. "Unveiling the four-pillar framework: Machine learning evidence on personality, firm, governance, and financial origins of managerial overconfidence in China," Pacific-Basin Finance Journal, Elsevier, vol. 92(C).
    14. Ziwei Wang & Chunfeng Wang & Zhenming Fang, 2024. "Learning from Failures of Co-owned Firms: Common Ownership and Information Disclosure Fraud," Journal of Business Ethics, Springer, vol. 195(1), pages 95-119, November.
    15. Xing Chen & Fenghua Wen & Jinli Xiao & Gary Gang Tian, 2025. "Weathering the Risk: How Climate Uncertainty Fuels Corporate Fraud," Journal of Business Ethics, Springer, vol. 201(2), pages 519-547, October.
    16. Wang, Ziwei & Yao, Shouyu & Sensoy, Ahmet & Goodell, John W. & Cheng, Feiyang, 2022. "Learning from failures: Director interlocks and corporate misconduct," International Review of Financial Analysis, Elsevier, vol. 84(C).
    17. Zhang, Yi & Liu, Tianxiang & Li, Weiping, 2024. "Corporate fraud detection based on linguistic readability vector: Application to financial companies in China," International Review of Financial Analysis, Elsevier, vol. 95(PB).
    18. Qifeng Zhao & Javier Cifuentes‐Faura & Long Wang & Qianfeng Luo, 2026. "Manipulators or Innovators? Corporate Misconduct and Green Innovation," Business Strategy and the Environment, Wiley Blackwell, vol. 35(3), pages 3898-3922, March.
    19. Chen, Tsung-Kang & Tseng, Yijie & Hao, Yun, 2025. "Readability of asset securitization reporting and bank holding company’s credit risk," The Quarterly Review of Economics and Finance, Elsevier, vol. 103(C).
    20. Li, Guowen & Wang, Shuai & Feng, Yuyao, 2024. "Making differences work: Financial fraud detection based on multi-subject perceptions," Emerging Markets Review, Elsevier, vol. 60(C).

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:riibaf:v:79:y:2025:i:c:s0275531925003356. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/ribaf .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.