FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models (2401.02982v4)
Abstract: LLMs have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. \texttt{FinDABench} assesses LLMs across three dimensions: 1) \textbf{Foundational Ability}, evaluating the models' ability to perform financial numerical calculation and corporate sentiment risk assessment; 2) \textbf{Reasoning Ability}, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) \textbf{Technical Skill}, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release \texttt{FinDABench}, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.
- Qwen technical report.
- Felm: Benchmarking factuality evaluation of large language models. arXiv preprint arXiv:2310.00741.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307.
- Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. Proceedings of EMNLP 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
- David R Krathwohl. 2002. A revision of bloom’s taxonomy: An overview. Theory into practice, 41(4):212–218.
- Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
- Fingpt: Democratizing internet-scale data for financial large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40.
- Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt. Accessed on December 28, 2023.
- OpenAI. 2023. Gpt-4 technical report.
- semipqa: A study on product question answering over semi-structured data. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 111–120.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Moss: Training conversational language models from synthetic data.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Pixiu: A large language model, instruction data and evaluation benchmark for finance.
- Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
- Finbert: A pretrained language model for financial communications.
- Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
- Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.
- Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Fineval: A chinese financial domain knowledge evaluation benchmark for large language models.
- Xuanyu Zhang and Qing Yang. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4435–4439.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Agieval: A human-centric benchmark for evaluating foundation models.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Dataset quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17205–17216.
- Shu Liu (146 papers)
- Shangqing Zhao (14 papers)
- Chenghao Jia (4 papers)
- Xinlin Zhuang (6 papers)
- Zhaoguang Long (2 papers)
- Man Lan (26 papers)
- Qingquan Wu (1 paper)
- Chong Yang (7 papers)
- Aimin Zhou (43 papers)
- Jie Zhou (687 papers)