FinanceBench: A New Benchmark for Financial Question Answering (2311.11944v1)

Published 20 Nov 2023 in cs.CL, cs.AI, cs.CE, and stat.ML

Abstract: FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.

Citations (38)

View on Semantic Scholar

Summary

The paper introduces FinanceBench as a novel benchmark that evaluates LLMs on open-book financial QA tasks.
It employs a dataset of 10,231 questions, including domain-specific, human-generated, and metrics-derived queries to test retrieval and numerical reasoning skills.
Experiments reveal significant LLM limitations, underscoring the need for improved retrieval integrations and advanced numerical reasoning capabilities.

FinanceBench: A Benchmarking Breakthrough in Financial Question Answering

The paper "FinanceBench: A New Benchmark for Financial Question Answering" introduces FinanceBench, a novel benchmark designed to evaluate the capabilities of LLMs in handling open book financial question-answering tasks. The financial domain poses significant challenges for LLMs due to its requirement for domain-specific knowledge, up-to-date financial information, and complex numerical reasoning. FinanceBench is introduced as a means to systematically measure the effectiveness of LLMs in performing these specialized tasks, thus filling a significant evaluation gap in the financial sector.

Dataset Composition and Structure

FinanceBench contains 10,231 questions related to publicly traded companies, designed to assess the ecological validity and realism of financial queries. The dataset includes three main types of questions:

Domain-Relevant Questions: These are generic questions applicable to financial analysis that warrant insights into typical company assessments such as dividend history and consistency in operating margins.
Novel Generated Questions: These are crafted by human annotators to be relevant to a specific company, financial document, and industry, aiming to mimic realistic queries from financial analysts.
Metrics-Generated Questions: Formulated using financial data extracted from key corporate documents, these questions require models to generate and reason over financial metrics, demanding sophisticated numerical reasoning capabilities.

Key Findings and Experimental Results

The authors tested 16 configurations of four state-of-the-art LLM systems, including configurations with vector store retrieval and extended context windows. Models such as GPT-4-Turbo, Llama2, and Claude2 were evaluated using a 150-case sample from FinanceBench. Notably, even when tested with retrieval-augmented setups, significant performance limitations were observed:

Retrieval Limitations: Models like GPT-4-Turbo with retrieval systems answered or refused to answer 81% of queries incorrectly without access to contextually relevant documents. This highlights the challenge of integrating retrieval-Augmented capabilities in LLMs.
Long Context Windows: While augmentation with extended context windows improved results, with systems like GPT-4-Turbo achieving up to 79% correctness, these solutions introduce latency issues and are impractical for lengthy financial documents.

The overall results highlight a crucial insight: existing LLMs, despite their potential, struggle with financial QA tasks due to inherent deficiencies in document retrieval and numerical reasoning abilities.

Implications and Future Directions

The implications of this research are twofold. Practically, the paper underscores the need for robust system enhancements before employing LLMs in high-stakes financial environments. Theoretical implications suggest the necessity for continued research into economically viable computational strategies for long-context processing and improved domain-specific pre-training methodologies.

Future work could focus on developing models with advanced numerical reasoning skills and enhanced retrieval capabilities, alongside exploring methods for incorporating real-time, domain-specific knowledge updates. Moreover, improving the generalization abilities of LLMs across different domains by building specialized, high-quality datasets like FinanceBench is essential.

Conclusion

FinanceBench presents a rigorous benchmarking suite, revealing the current limitations of LLMs in answering financial questions effectively. This work provides a foundation for further research and development towards smarter, finance-aware AI solutions. By addressing the dataset's limitations and continuing to evolve LLM capabilities, researchers can advance the application of AI technologies in finance, thus aiding in the automation and augmentation of financial analysis.

PDF Markdown

Related Papers

GitHub

GitHub - patronus-ai/financebench (42 stars)

Tweets

https://twitter.com/ninoscherrer/status/1774122164951691736

https://twitter.com/virattt/status/1842742656746635399

https://twitter.com/PatronusAI/status/1866198863456907713

https://twitter.com/GabeStengel/status/1773112702530965949

https://twitter.com/cypherofalpha/status/1859067407123222955

https://twitter.com/PatronusAI/status/1906733947419340925