Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models (2407.02301v1)

Published 2 Jul 2024 in cs.CL

Abstract: LLMs have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023.
  4. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pages 491–500, 2019.
  5. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997.
  6. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  7. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205, 2023.
  8. Tigerbot: An open multilingual multitask llm. arXiv preprint arXiv:2312.08688, 2023.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  10. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  11. DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
  12. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  13. Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289, 2023.
  14. Ant Group. Ant-fin-eva. https://github.com/alipay/financial_evaluation_dataset/, 2023.
  15. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18099–18107, 2024.
  16. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  17. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  18. No language is an island: Unifying chinese and english in financial large language models, instruction data, and benchmarks. arXiv preprint arXiv:2403.06249, 2024.
  19. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  20. East Money Information. Openfindata. https://github.com/open-compass/OpenFinData/, 2023.
  21. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  22. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
  23. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  24. Cfbenchmark: Chinese financial assistant benchmark for large language model. arXiv preprint arXiv:2311.05812, 2023.
  25. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
  26. Cfgpt: Chinese financial assistant with large language model. arXiv preprint arXiv:2309.10654, 2023.
  27. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  28. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  29. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432, 2023.
  30. Meta. Llama3. https://llama.meta.com/llama3, 2024.
  31. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  32. OpenAI. Gpt-3.5-turbo. https://www.openai.com/chatgpt, 2021.
  33. OpenAI. Gpt-4 technical report, 2023.
  34. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  35. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  36. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  37. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083, 2022.
  38. Robert J Shiller. Finance and the good society. Princeton University Press, 2013.
  39. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  40. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  41. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  44. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  45. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  46. Pangu-π𝜋\piitalic_π: Enhancing language model architectures via nonlinearity compensation. arXiv preprint arXiv:2312.17276, 2023.
  47. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  48. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
  49. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  50. The finben: An holistic financial benchmark for large language models. arXiv preprint arXiv:2402.12659, 2024.
  51. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443, 2023.
  52. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  53. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
  54. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  55. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
  56. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  57. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
  58. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  59. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models. arXiv preprint arXiv:2308.09975, 2023.
  60. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474, 2023.
  61. Cgce: A chinese generative chat evaluation benchmark for general and financial domains. arXiv preprint arXiv:2305.14471, 2023.
  62. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4435–4439, 2023.
  63. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Ying Nie (15 papers)
  2. Binwei Yan (1 paper)
  3. Tianyu Guo (33 papers)
  4. Hao Liu (497 papers)
  5. Haoyu Wang (309 papers)
  6. Wei He (188 papers)
  7. Binfan Zheng (5 papers)
  8. Weihao Wang (11 papers)
  9. Qiang Li (449 papers)
  10. Weijian Sun (4 papers)
  11. Yunhe Wang (145 papers)
  12. Dacheng Tao (826 papers)
Github Logo Streamline Icon: https://streamlinehq.com