Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models (2401.02982v4)

Published 1 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. \texttt{FinDABench} assesses LLMs across three dimensions: 1) \textbf{Foundational Ability}, evaluating the models' ability to perform financial numerical calculation and corporate sentiment risk assessment; 2) \textbf{Reasoning Ability}, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) \textbf{Technical Skill}, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release \texttt{FinDABench}, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Qwen technical report.
  2. Felm: Benchmarking factuality evaluation of large language models. arXiv preprint arXiv:2310.00741.
  3. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307.
  4. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. Proceedings of EMNLP 2022.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186.
  9. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  10. Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289.
  11. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  12. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
  13. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  14. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
  15. David R Krathwohl. 2002. A revision of bloom’s taxonomy: An overview. Theory into practice, 41(4):212–218.
  16. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.
  17. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
  18. Fingpt: Democratizing internet-scale data for financial large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  19. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432.
  20. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40.
  21. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655.
  22. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt. Accessed on December 28, 2023.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. semipqa: A study on product question answering over semi-structured data. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 111–120.
  25. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  26. Moss: Training conversational language models from synthetic data.
  27. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  28. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  31. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  33. Pixiu: A large language model, instruction data and evaluation benchmark for finance.
  34. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
  35. Finbert: A pretrained language model for financial communications.
  36. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
  37. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296.
  38. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.
  39. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
  40. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  41. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models.
  42. Xuanyu Zhang and Qing Yang. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4435–4439.
  43. A survey of large language models. arXiv preprint arXiv:2303.18223.
  44. Agieval: A human-centric benchmark for evaluating foundation models.
  45. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  46. Dataset quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17205–17216.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Shu Liu (146 papers)
  2. Shangqing Zhao (14 papers)
  3. Chenghao Jia (4 papers)
  4. Xinlin Zhuang (6 papers)
  5. Zhaoguang Long (2 papers)
  6. Man Lan (26 papers)
  7. Qingquan Wu (1 paper)
  8. Chong Yang (7 papers)
  9. Aimin Zhou (43 papers)
  10. Jie Zhou (687 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com