FinForge: Scalable Financial LM Benchmark
- FinForge is a scalable, semi-synthetic benchmark framework that measures financial reasoning by testing language models with curated, authoritative corpora.
- It systematically combines expert-guided corpus curation with controlled LM-driven QA synthesis to produce over 5,000 validated question–answer pairs across 11 financial subdomains.
- Evaluations using FinForge reveal significant performance disparities in quantitative and counterfactual reasoning, emphasizing the need for domain-adaptive model improvements.
FinForge is a scalable, semi-synthetic benchmark generation framework designed to evaluate LMs on depth and breadth of real-world financial reasoning. Addressing the lack of rigorous, domain-specific testbeds, FinForge systematically curates authoritative financial corpora, synthesizes question–answer (QA) pairs through a structured, multi-stage LLM workflow, and enforces rigorous validation routines—both automated and expert-driven. The current benchmark release, FinForge-5k, comprises over 5,000 human-validated QA pairs across eleven finance subdomains, seeded from a verified corpus of 100,000 documents with a total token count of 143 million. Evaluations of leading LMs using FinForge-5k highlight critical performance disparities and elucidate current limitations in financial domain competence (Matlin et al., 11 Jan 2026).
1. Semi-Synthetic Benchmark Generation Pipeline
The FinForge pipeline consists of two primary stages: expert-guided corpus curation and controlled LLM–driven QA synthesis.
1.1 Expert-Guided Corpus Curation
Financial domain experts systematically decomposed the field into eleven subdomains by referencing authoritative curricula and industry standards. For each subdomain, analysts selected only web domains with stringent editorial oversight (e.g., central banks, academic presses, regulatory agencies), explicitly excluding unverified or opinion-based sources. Programmatic scraping—using Trafilatura, BeautifulSoup (for HTML content), and PyMuPDF4LLM (for PDFs)—combined sitemap traversal, keyword co-occurrence, and link-structure heuristics to extract documents. Stringent normalization routines removed boilerplate, advertisements, and navigation; downstream filtering enforced document length, linguistic integrity, and deduplication, yielding a high-quality, contemporaneous corpus.
1.2 LLM–Driven Question Synthesis
The structured QA generation uses a five-stage process powered primarily by Gemini 2.5 Flash:
- Deep Document Analysis: Identification of key financial reasoning primitives such as causal mechanisms, competing hypotheses, necessary assumptions, and counterfactuals.
- Answer-Plan Drafting: An LM agent generates a blueprint specifying the focal concept, a difficulty rating (on a 1–5 scale from recall to expert multi-constraint reasoning), and the minimal required context.
- Question & Distractor Generation: Gemini 2.5 Flash composes multiple-choice items with a correct answer, three challenge-level distractors, and a rationale.
- Metadata Labeling: Each question is annotated with its subdomain, difficulty rating, and reasoning type (quantitative, conceptual, counterfactual).
- Automated Rubric-Based Validation (“LM-as-Judge”): Candidate items are evaluated for financial relevance, self-sufficiency, logical consistency, clarity, and complexity.
This hybrid expert-LM methodology preserves domain fidelity while facilitating scale and reproducibility.
2. Corpus Composition and QA Pair Production
The completed reference corpus encompasses over 100,000 verified documents totaling 143 million tokens. For benchmark construction, stratified random sampling selected 10,000 source documents across the eleven subdomains. The sequential QA-generation pipeline produced 10,000 initial candidates. Following automated rubric-based filtering, rule-based checks, and expert exclusion, the final FinForge-5k set consists of 5,000 high-quality question–answer pairs.
3. Subdomain Taxonomy and Proportional Representation
FinForge uses a domain taxonomy spanning eleven specialized subfields, with each equally represented in the benchmark to ensure balanced coverage.
| Subdomain | Approximate QAs |
|---|---|
| Alternative Investments & Real Estate | 450–500 |
| Behavioral & Quantitative Finance | 450–500 |
| Corporate Finance & Valuation | 450–500 |
| FinTech & Innovation | 450–500 |
| Financial Accounting & Reporting | 450–500 |
| Financial Ethics & Governance | 450–500 |
| Markets & Derivatives | 450–500 |
| Regulation & Compliance | 450–500 |
| Investment & Portfolio Management | 450–500 |
| Personal Finance & Wealth Management | 450–500 |
| Public & International Finance | 450–500 |
This proportional structure enables fine-grained analysis of LMs’ subdomain-specific proficiencies and gaps.
4. Validation Protocol and Human Expert Oversight
The pipeline enforces multi-layered validation.
Automated Validation:
Gemini 2.5 Flash scores each candidate item on the five-dimension rubric—financial relevance, self-sufficiency, logical consistency, clarity, and complexity. Only items meeting all criteria are retained.
Human Expert Review:
Downstream quality assurance involves a panel of three finance specialists who independently audit a stratified sample (10%, i.e., 500 questions). The panel evaluates factual accuracy, clarity, and practical relevance. Experts approve 70% of sampled items outright, with 30% flagged for missing context or subtle ambiguities. This demonstrates the necessity of human oversight to supplement automated validation, particularly in high-stakes financial domains.
5. Evaluation Metrics and Model Performance
The principal evaluation metric is multiple-choice accuracy:
(While Cohen’s kappa is a standard inter-rater metric, only the accuracy statistic is directly reported.)
Empirical evaluations (Table 1 of the source) across eight leading open- and closed-source models exhibit considerable variance:
- Qwen-3-235B achieves the highest public accuracy at 77.1%.
- DeepSeek-V3 follows at 73.9%, GPT-4o at 73.4%.
- Open-source models in the 32–80B parameter range (e.g., Qwen3-Next 80B) remain within 5% of proprietary state-of-the-art.
- Models below 32B parameters yield accuracy between 56% and 61%, highlighting a pronounced drop-off in sophisticated financial reasoning.
These results indicate a persistent performance gap, especially for smaller models and for tasks requiring nuanced domain reasoning.
6. Diagnostic Insights and Implications for LM Research
FinForge-5k’s fine-grained subdomain and difficulty annotations reveal that even the best-performing LMs underperform on tasks involving:
- Personal Finance & Wealth Management and Corporate Finance & Valuation—particularly those necessitating multi-constraint optimization (e.g., considering taxes, liquidity, risk simultaneously).
- Quantitative and counterfactual reasoning—with model errors manifesting as both conceptual misapplications (selecting inappropriate financial methodologies) and arithmetic mistakes disruptive to real-world workflows.
These findings underscore that:
- Model parameter scale alone does not confer domain expertise; pretraining data composition and domain-adaptive fine-tuning are critical.
- External tool-based arithmetic assistance may reduce calculation errors, whereas conceptual shortcomings require focused domain instruction.
- The FinForge pipeline’s dynamic construction capability supports continual benchmark generation—accommodating regulatory changes, new instruments, and emergent research, thus preserving test novelty and minimizing data leakage.
7. Significance and Prospects
FinForge-5k establishes a transparent, reproducible, and scalable framework for rigorous evaluation of LMs in finance. By exposing the granular contours of current model capabilities and failures, FinForge functions as both a testbed and a guide for the design of domain-aware pretraining regimens, dedicated financial reasoning architectures, and continual evaluation workflows in high-stakes professional environments. A plausible implication is that sustained use of such semi-synthetic benchmarks will drive targeted advances in LM competence aligned with the evolving demands of financial research and industry applications (Matlin et al., 11 Jan 2026).