FinBen Benchmark: Evaluating LLMs for Finance
- FinBen is an open-source benchmark that defines a comprehensive framework for evaluating LLMs across 24 financial NLP tasks and decision-making scenarios.
- It integrates 36 standardized datasets and applies a cognitive-hierarchy model to assess tasks from basic information extraction to sophisticated agent-based trading.
- Empirical evaluations reveal that while LLMs excel at foundational tasks, they face challenges in complex reasoning and forecasting, guiding future financial AI research.
FinBen is the first open-source, holistic evaluation benchmark specifically conceived to rigorously assess LLMs across the full spectrum of tasks encountered in financial natural language processing and decision-making. It enables reproducible evaluation of LLMs and agents on 36 diverse datasets grouped into 24 tasks, spanning information extraction, sentiment and argument mining, complex reasoning, text generation, forecasting, risk management, and decision-making—including agent-based simulations for stock trading. The design, evaluation methodology, released datasets, and key findings are central contributions in the empirical and methodological study of financial AI (Xie et al., 2024).
1. Scope and Structural Organization
FinBen is organized according to two main principles: broad coverage of financial use cases and a cognitive-hierarchy inspired by the Cattell–Horn–Carroll (CHC) model. The benchmark decomposes financial intelligence into seven core aspects:
- Information Extraction (IE): Entity and relation extraction.
- Textual Analysis: Sentiment, argument mining, and causal inference.
- Question Answering (QA): Both numeric and conversational.
- Text Generation: Extractive and abstractive summarization.
- Risk Management: Credit, fraud, and distress identification; insurance claim analysis.
- Forecasting: Stock-movement, credit scoring, fraud, and financial distress prediction.
- Decision-making: Simulated stock-trading agent evaluation.
Within this hierarchy, tasks are distributed across foundational (quantification, extraction, numerical understanding), advanced cognitive (generation, forecasting), and general intelligence (decision-making) spectra. The coverage includes 36 public datasets, each mapped to targeted evaluation tasks:
| Aspect | Task Types | Example Datasets (test size) |
|---|---|---|
| Information Extraction | NER, Relation Extraction, Causal Span/Label | NER (980), FiNER-ORD (1080), FinRED (1068) |
| Textual Analysis | Sentiment, argument, ESG, classification | FPB (970), Headlines (2283), FinArg-ACC (969) |
| Numerical Understanding | Numeric/conversational QA, numeric-span, analogies | FinQA (1147), ConvFinQA (1490), FNXL (318) |
| Text Generation | Earnings/news summarization | ECTSum (495), EDTSum (2000) |
| Forecasting | Stock-movement, credit/fraud/distress prediction | BigData22 (1470), German (1000), ccf (2278) |
| Risk Management | Insurance claim analysis | ProtoSeguro (2381), travelinsurance (3800) |
| Decision-making | Stock trading agent simulation | FinTrade (3384 trading days, 7 tickers) |
Tasks are distributed to emphasize both breadth (24 task types) and depth of reasoning, supporting analysis of LLM capabilities at various levels of financial cognition (Xie et al., 2024).
2. Datasets and Task Definitions
FinBen unifies and standardizes 36 publicly available datasets, each aligned with a clearly defined evaluation protocol. Task categories include:
- Classification: Sentiment, headline classification, argument mining, multi-class/ESG (e.g., FPB, MA, MLESG).
- Extraction: NER, relation and causal extraction (e.g., FinRED, CD).
- Numerical QA: Multi-step and conversational arithmetic (e.g., FinQA, TATQA, ConvFinQA, FSRL).
- Summarization: ROUGE/BERTScore/BARTScore on earnings calls and news (e.g., ECTSum, EDTSum).
- Forecasting: Binary/multi-class prediction for stock movement, fraud, risk (e.g., ACL18, polish).
- Decision-making: Portfolio management via discrete-time agent simulation (FinTrade).
Three novel datasets are introduced for finance-specific summarization (ECTSum/EDTSum), multi-step QA (ConvFinQA), and agent-based trading (FinTrade). All data, problem splits, and prompt templates are released under open-source licenses, enabling reproducibility and extensibility (Xie et al., 2024).
3. Evaluation Methodologies and Metrics
Each task specifies a protocol grounded in NLP and financial ML conventions, structured as follows:
- Classification/Quantification: Zero-/few-shot prompted classification; metrics: F1, Accuracy.
- NER/Extraction: Sequence labeling evaluated via F1, Entity F1.
- Numerical QA: Direct answer generation; Exact-Match Accuracy (EMAcc) and F1.
- Summarization: Prompt-based; scored via ROUGE-, BERTScore, and BARTScore.
- Forecasting: Prompted binary/multiclass; Accuracy and Matthews Correlation Coefficient (MCC).
- Stock Trading Agent: Portfolio return simulation; metrics: Cumulative Return (CR), Sharpe Ratio (SR), Daily Volatility (DV), Annualized Volatility (AV), Maximum Drawdown (MD).
Key formulas:
- Precision:
- Recall:
- F1:
- Exact-Match:
- MCC:
- CR: , SR and others per standard financial evaluation (Xie et al., 2024).
For agent-based trading, a “prompt-as-policy” paradigm is employed: at each trading step, the LLM receives multi-timescale memory summaries (short, mid, long, reflection windows) and is prompted for buy/sell/hold actions with justification. Lightweight retrieval-augmented generation (RAG) is used for long-context tasks (e.g., ConvFinQA, trading) via on-the-fly retrieval of tables or previous QAs (Xie et al., 2024).
4. Experimental Findings and Benchmarked Models
FinBen has been used to systematically evaluate 15 contemporary LLMs, including GPT-4, ChatGPT, Gemini, Baichuan2, LLaMA2 (7B/70B), ChatGLM3, InternLM, Falcon, Mixtral, Code LLaMA, FinGPT, FinMA, DISC-FinLLM, and CFGPT.
Key empirical results include:
- Foundational tasks: GPT-4 and ChatGPT lead in quantification (average F1 ≈ 0.78 on FPB; 0.86 on Headlines). Instruction-tuned LLMs (FinMA-7B) approach GPT-4 on simple tasks but not complex reasoning.
- Extraction: All models fail (F1 ≈ 0) on highly compositional extraction (FinRED, CD, FNXL, FSRL).
- Numerical QA: GPT-4 attains EM ≈ 63% (FinQA), 76% (ConvFinQA); most models approach random.
- Summarization: Gemini achieves the top ROUGE-1 (≈0.39) for abstractive news summarization; extractive performance is poor across LLMs.
- Forecasting: Gemini leads (MCC up to 0.04), but all LLMs lag far behind dedicated models in stock/credit/fraud prediction.
- Trading: LLMs outperform buy-and-hold baselines across tickers. GPT-4 achieves the highest Sharpe Ratio (>1) and lowest Max Drawdown (~18%), indicating most robust risk-return management; Gemini follows. Models with <70B parameters fail to produce functional policies.
Instruction-tuning enhances performance on simple classification and sentiment but does not provide significant gains for numerical QA or higher-order reasoning. Notably, cross-lingual fine-tuning (CFGPT sft-7B, Chinese data) can substantially degrade English financial task performance (Xie et al., 2024).
5. Novel Contributions and Innovations
FinBen introduces several firsts in financial NLP benchmarking:
- The first open-source, finance-domain extractive and abstractive summarization datasets: ECTSum (earnings calls) and EDTSum (news).
- The first open-source, multi-step numerical QA datasets in finance: FinQA, TATQA, ConvFinQA.
- The first open-source, LLM-driven stock trading agent benchmark: FinTrade, leveraging the FinMem memory-augmented agent architecture.
- Integrated agent-based evaluation with explicit prompt-as-policy structure, multi-timescale memory, and comprehensive metric reporting.
- Retrieval-augmented evaluation for long-context tasks, seamlessly blending information retrieval and LLM inference pipelines (Xie et al., 2024).
6. Ecosystem and Community Impact
FinBen powered the first shared task for financial LLMs at the FinNLP-AgentScen workshop (IJCAI-2024), attracting 12 teams. The leading solutions, using FinBen’s multi-memory prompts and agent framework, exceeded the prior best published GPT-4 results, demonstrating the suite’s value for advancing LLM research in finance. The entire suite (datasets, code, prompt templates) is released under MIT/CC licenses, supporting full reproducibility.
Reproducing experiments typically requires GPU hardware (5–7B model: e.g., TITAN RTX, RTX 3090), zero-/few-shot prompt scripts, and ≈20 GPU-hours per model for a complete run. The suite provides detailed instructions, enabling benchmarking, targeted fine-tuning, and future innovation at the interface of NLP, finance, and AI agent research (Xie et al., 2024).
7. Significance and Outlook
By integrating a broad suite of tasks and datasets under a cognitive-hierarchical framework, FinBen enables rigorous, head-to-head comparison of LLMs for financial intelligence. It exposes persistent challenges (e.g., compositional information extraction, advanced forecasting, agentic reasoning) and empirically quantifies the limits and progress of both general and instruction-tuned models. Its open, extensible architecture has established a paradigm for future benchmarks in both financial and more general agent-based LLM evaluation (Xie et al., 2024).