Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FinBen: A Holistic Financial Benchmark for Large Language Models (2402.12659v2)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.CE
FinBen: A Holistic Financial Benchmark for Large Language Models

Abstract: LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.

Comprehensive Evaluation of LLMs in Finance Using the FinBen Benchmark

Introduction to FinBen

The finance industry stands on the cusp of a transformation, courtesy of advancements in Language Large Models (LLMs) that promise to enhance financial analytics, forecasting, and decision-making. Despite notable strides in the application of LLMs across various domains, their potential in finance has been relatively uncharted due to the intricate nature of financial tasks and a paucity of comprehensive evaluation frameworks. To address this gap, the presented paper introduces FinBen, a pioneering benchmark designed to systematically assess LLMs' proficiency in the financial domain. FinBen's architecture, inspired by the Cattell-Horn-Carroll (CHC) theory, encompasses a wide array of financial tasks categorized under three spectrums of difficulty. This enables a holistic evaluation of LLMs, shedding light on their capabilities and limitations within financial applications.

Benchmark Design and Evaluation Framework

FinBen enriches the landscape of financial benchmarks by offering a robust, open-sourced evaluation tool tailored to the financial sector's unique requirements. It features 35 datasets spanning 23 financial tasks, bridging crucial gaps observed in existing benchmarks.

Spectrum I: Foundational Tasks

  • Quantification, Extraction, and Numerical Understanding tasks form the foundational spectrum, aiming to gauge basic cognitive skills such as inductive reasoning and associative memory.
  • A variety of datasets, including FPB, FiQA-SA, and TSA, facilitate the evaluation of sentiment analysis, news headline classification, and more.

Spectrum II: Advanced Cognitive Engagement

  • Generation and Forecasting tasks, demanding higher cognitive skills like crystallized and fluid intelligence, constitute the second tier.
  • Datasets like ECTSUM and BigData22 challenge LLMs to produce coherent text outputs and predict future market behaviors, respectively.

Spectrum III: General Intelligence

  • At the apex, the stock trading task represents the ultimate test of an LLM's general intelligence, embodying strategic decision-making and real-world application capabilities.

Key Findings and Insights

The evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and Gemini, via the FinBen benchmark offers intriguing insights:

  • GPT-4 excels in foundational tasks such as quantification and numerical understanding but exhibits areas for improvement in more complex extraction tasks.
  • Gemini demonstrates remarkable ability in generation and forecasting tasks, hinting at its advanced cognitive engagement capabilities.
  • The efficacy of instruction tuning is underscored, with significant performance boosts observed in simpler tasks.

These findings underscore the nuanced capabilities and potential improvement areas for LLMs within the financial domain, highlighting the imperative for continuous development and refinement.

Implications and Future Directions

The creation and deployment of the FinBen benchmark represent a significant stride towards understanding and harnessing the capabilities of LLMs in finance. By providing a comprehensive evaluation tool, FinBen facilitates the identification of strengths, weaknesses, and development opportunities for LLMs in financial applications.

Looking ahead, the continuous expansion of FinBen is envisioned to include additional languages and a wider array of financial tasks. This endeavor aims to not only extend the benchmark's utility and applicability but also to stimulate further advancements in the development of financial LLMs. The journey towards fully realizing LLMs' potential in finance is complex and challenging, yet FinBen lays a foundational stone, guiding the path towards more intelligent, efficient, and robust financial analytical tools and methodologies.

Concluding Remarks

In a rapidly evolving landscape where finance intersects with cutting-edge AI technologies, benchmarks like FinBen play a pivotal role in advancing our understanding and capabilities. This comprehensive framework not only champions the assessment of LLMs in financial contexts but also paves the way for future innovations, fostering a symbiotic growth between finance and AI. As we continue to explore and expand the frontiers of AI in finance, benchmarks such as FinBen will remain indispensable in our quest to unlock the full potential of LLMs, driving towards more informed, efficient, and innovative financial ecosystems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  2. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90.
  3. Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models.
  4. Robert A Ariel. 1987. A monthly effect in stock returns. Journal of financial economics, 18(1):161–174.
  5. Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  8. Multi-lingual esg issue identification. In Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, pages 111–115.
  9. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning.
  10. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205.
  11. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711.
  12. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering.
  13. Davide Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13.
  14. Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 519–535.
  15. Laiw: A chinese legal large language models benchmark.
  16. Leon Derczynski. 2016. Complementarity, F-score, and NLP evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 261–266, Portorož, Slovenia. European Language Resources Association (ELRA).
  17. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  18. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv preprint arXiv:2310.00566.
  19. Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European conference on information retrieval, pages 345–359. Springer.
  20. Mastering pair trading with risk-aware recurrent reinforcement learning.
  21. Select and trade: Towards unified pair trading with hierarchical reinforcement learning. arXiv preprint arXiv:2301.10724.
  22. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  23. Hans Hofmann. 1994. Statlog (German Credit Data). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5NC77.
  24. FinBART: A pre-trained seq2seq language model for Chinese financial tasks. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 906–917, Harbin, China. Chinese Information Processing Society of China.
  25. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944.
  26. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  27. Multifin: A dataset for multilingual financial nlp. In Findings of the Association for Computational Linguistics: EACL 2023, pages 864–879.
  28. How are we detecting inconsistent method names? an empirical study from code review perspective. arXiv preprint arXiv:2308.12701.
  29. Bizbench: A quantitative reasoning benchmark for business and finance. arXiv preprint arXiv:2311.06602.
  30. Textual analogy parsing: What’s shared and what’s compared among analogous facts. arXiv preprint arXiv:1809.02700.
  31. A survey of large language models in finance (finllms).
  32. Cfbenchmark: Chinese financial assistant benchmark for large language model.
  33. Cfgpt: Chinese financial assistant with large language model.
  34. Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? an examination on several typical tasks. arXiv preprint arXiv:2305.05862.
  35. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  36. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  37. Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485.
  38. Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning.
  39. Dynamic datasets and market environments for financial reinforcement learning.
  40. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 4513–4519. International Joint Conferences on Artificial Intelligence Organization. Special Track on AI in FinTech.
  41. Alejandro Lopez-Lira and Yuehua Tang. 2023. Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.
  42. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432.
  43. Malik Magdon-Ismail and Amir F Atiya. 2004. Maximum drawdown. Risk Magazine, 17(10):99–102.
  44. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
  45. Www’18 open challenge: Financial opinion mining and question answering. pages 1941–1942.
  46. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796.
  47. Financial document causality detection shared task (fincausal 2020). arXiv preprint arXiv:2012.02505.
  48. Kevin S McGrew. 2009. Chc theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research.
  49. Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts. arXiv preprint arXiv:2210.12467.
  50. OpenAI. 2023a. Gpt-4 technical report.
  51. R OpenAI. 2023b. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
  52. André E Punt. 2017. Strategic management decision-making in a complex world: quantifying, understanding, and using trade-offs. ICES Journal of Marine Science, 74(2):499–510.
  53. Ross Quinlan. Statlog (Australian Credit Approval). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C59012.
  54. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  55. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, Parramatta, Australia.
  56. W Joel Schneider and Kevin S McGrew. 2012. The cattell-horn-carroll model of intelligence.
  57. Trillion dollar words: A new financial dataset, task & market analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6664–6679, Toronto, Canada. Association for Computational Linguistics.
  58. Finer: Financial named entity recognition dataset and weak-supervision model. arXiv preprint arXiv:2302.11157.
  59. When flue meets flang: Benchmarks and large pretrained language model for financial domain. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2322–2335.
  60. Financial numeric extreme labelling: A dataset and benchmarking. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3550–3561.
  61. Finred: A dataset for relation extraction in financial domain. In Companion Proceedings of the Web Conference 2022, pages 595–597.
  62. William F Sharpe. 1998. The sharpe ratio. Streetwise–the Best of the Journal of Portfolio Management, 3:169–85.
  63. Ankur Sinha and Tanmay Khandait. 2020. Impact of news on the commodity market: Dataset and results.
  64. Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 589–601. Springer.
  65. Accurate stock movement prediction with self-supervised learning from sparse noisy tweets. In 2022 IEEE International Conference on Big Data (Big Data), pages 1691–1700. IEEE.
  66. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  67. Fine-grained argument understanding with bert ensemble techniques: A deep dive into financial sentiment analysis. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), pages 242–249.
  68. Fin-Eva Team. 2023a. Fin-eva version 1.0.
  69. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  70. InternLM Team. 2023b. Internlm: A multilingual language model with progressively enhanced capabilities.
  71. Llama: Open and efficient foundation language models.
  72. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets.
  73. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.
  74. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 1627–1630.
  75. Bloomberggpt: A large language model for finance.
  76. The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges. arXiv preprint arXiv:2304.05351.
  77. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443.
  78. Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1979.
  79. Fingpt: Open-source financial large language models.
  80. Generating plausible counterfactual explanations for deep transformers in financial text classification. arXiv preprint arXiv:2010.12512.
  81. Investlm: A large language model for investment using financial domain instruction tuning.
  82. Finbert: A pretrained language model for financial communications.
  83. Finmem: A performance-enhanced llm trading agent with layered memory and character design.
  84. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  85. Instruction tuning for large language models: A survey.
  86. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  87. Dólares or dollars? unraveling the bilingual prowess of financial llms between spanish and english.
  88. Cgce: A chinese generative chat evaluation benchmark for general and financial domains.
  89. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters.
  90. Forecasting the equity premium: Do deep neural network models work? Modern Finance, 1(1):1–11.
  91. Trade the event: Corporate events detection for news-based event-driven trading.
  92. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (34)
  1. Qianqian Xie (60 papers)
  2. Weiguang Han (10 papers)
  3. Zhengyu Chen (45 papers)
  4. Ruoyu Xiang (6 papers)
  5. Xiao Zhang (435 papers)
  6. Yueru He (9 papers)
  7. Mengxi Xiao (5 papers)
  8. Dong Li (429 papers)
  9. Yongfu Dai (5 papers)
  10. Duanyu Feng (13 papers)
  11. Yijing Xu (2 papers)
  12. Haoqiang Kang (7 papers)
  13. Ziyan Kuang (4 papers)
  14. Chenhan Yuan (14 papers)
  15. Kailai Yang (22 papers)
  16. Zheheng Luo (14 papers)
  17. Tianlin Zhang (17 papers)
  18. Zhiwei Liu (114 papers)
  19. Guojun Xiong (27 papers)
  20. Zhiyang Deng (7 papers)
Citations (14)
Youtube Logo Streamline Icon: https://streamlinehq.com