FinForge: Scalable Financial LM Benchmark

Updated 18 January 2026

FinForge is a scalable, semi-synthetic benchmark framework that measures financial reasoning by testing language models with curated, authoritative corpora.
It systematically combines expert-guided corpus curation with controlled LM-driven QA synthesis to produce over 5,000 validated question–answer pairs across 11 financial subdomains.
Evaluations using FinForge reveal significant performance disparities in quantitative and counterfactual reasoning, emphasizing the need for domain-adaptive model improvements.

FinForge is a scalable, semi-synthetic benchmark generation framework designed to evaluate LMs on depth and breadth of real-world financial reasoning. Addressing the lack of rigorous, domain-specific testbeds, FinForge systematically curates authoritative financial corpora, synthesizes question–answer (QA) pairs through a structured, multi-stage LLM workflow, and enforces rigorous validation routines—both automated and expert-driven. The current benchmark release, FinForge-5k, comprises over 5,000 human-validated QA pairs across eleven finance subdomains, seeded from a verified corpus of 100,000 documents with a total token count of 143 million. Evaluations of leading LMs using FinForge-5k highlight critical performance disparities and elucidate current limitations in financial domain competence (Matlin et al., 11 Jan 2026).

1. Semi-Synthetic Benchmark Generation Pipeline

The FinForge pipeline consists of two primary stages: expert-guided corpus curation and controlled LLM–driven QA synthesis.

1.1 Expert-Guided Corpus Curation

Financial domain experts systematically decomposed the field into eleven subdomains by referencing authoritative curricula and industry standards. For each subdomain, analysts selected only web domains with stringent editorial oversight (e.g., central banks, academic presses, regulatory agencies), explicitly excluding unverified or opinion-based sources. Programmatic scraping—using Trafilatura, BeautifulSoup (for HTML content), and PyMuPDF4LLM (for PDFs)—combined sitemap traversal, keyword co-occurrence, and link-structure heuristics to extract documents. Stringent normalization routines removed boilerplate, advertisements, and navigation; downstream filtering enforced document length, linguistic integrity, and deduplication, yielding a high-quality, contemporaneous corpus.

1.2 LLM–Driven Question Synthesis

The structured QA generation uses a five-stage process powered primarily by Gemini 2.5 Flash:

Deep Document Analysis: Identification of key financial reasoning primitives such as causal mechanisms, competing hypotheses, necessary assumptions, and counterfactuals.
Answer-Plan Drafting: An LM agent generates a blueprint specifying the focal concept, a difficulty rating (on a 1–5 scale from recall to expert multi-constraint reasoning), and the minimal required context.
Question & Distractor Generation: Gemini 2.5 Flash composes multiple-choice items with a correct answer, three challenge-level distractors, and a rationale.
Metadata Labeling: Each question is annotated with its subdomain, difficulty rating, and reasoning type (quantitative, conceptual, counterfactual).
Automated Rubric-Based Validation (“LM-as-Judge”): Candidate items are evaluated for financial relevance, self-sufficiency, logical consistency, clarity, and complexity.

This hybrid expert-LM methodology preserves domain fidelity while facilitating scale and reproducibility.

2. Corpus Composition and QA Pair Production

The completed reference corpus encompasses over 100,000 verified documents totaling 143 million tokens. For benchmark construction, stratified random sampling selected 10,000 source documents across the eleven subdomains. The sequential QA-generation pipeline produced 10,000 initial candidates. Following automated rubric-based filtering, rule-based checks, and expert exclusion, the final FinForge-5k set consists of 5,000 high-quality question–answer pairs.

3. Subdomain Taxonomy and Proportional Representation

FinForge uses a domain taxonomy spanning eleven specialized subfields, with each equally represented in the benchmark to ensure balanced coverage.

Subdomain	Approximate QAs
Alternative Investments & Real Estate	450–500
Behavioral & Quantitative Finance	450–500
Corporate Finance & Valuation	450–500
FinTech & Innovation	450–500
Financial Accounting & Reporting	450–500
Financial Ethics & Governance	450–500
Markets & Derivatives	450–500
Regulation & Compliance	450–500
Investment & Portfolio Management	450–500
Personal Finance & Wealth Management	450–500
Public & International Finance	450–500

This proportional structure enables fine-grained analysis of LMs’ subdomain-specific proficiencies and gaps.

4. Validation Protocol and Human Expert Oversight

The pipeline enforces multi-layered validation.

Automated Validation:

Gemini 2.5 Flash scores each candidate item on the five-dimension rubric—financial relevance, self-sufficiency, logical consistency, clarity, and complexity. Only items meeting all criteria are retained.

Human Expert Review:

Downstream quality assurance involves a panel of three finance specialists who independently audit a stratified sample (10%, i.e., 500 questions). The panel evaluates factual accuracy, clarity, and practical relevance. Experts approve 70% of sampled items outright, with 30% flagged for missing context or subtle ambiguities. This demonstrates the necessity of human oversight to supplement automated validation, particularly in high-stakes financial domains.

5. Evaluation Metrics and Model Performance

The principal evaluation metric is multiple-choice accuracy:

$A = \frac{\text{number of correctly answered questions}}{\text{total number of questions}}$

(While Cohen’s kappa is a standard inter-rater metric, only the accuracy statistic is directly reported.)

Empirical evaluations (Table 1 of the source) across eight leading open- and closed-source models exhibit considerable variance:

Qwen-3-235B achieves the highest public accuracy at 77.1%.
DeepSeek-V3 follows at 73.9%, GPT-4o at 73.4%.
Open-source models in the 32–80B parameter range (e.g., Qwen3-Next 80B) remain within 5% of proprietary state-of-the-art.
Models below 32B parameters yield accuracy between 56% and 61%, highlighting a pronounced drop-off in sophisticated financial reasoning.

These results indicate a persistent performance gap, especially for smaller models and for tasks requiring nuanced domain reasoning.

6. Diagnostic Insights and Implications for LM Research

FinForge-5k’s fine-grained subdomain and difficulty annotations reveal that even the best-performing LMs underperform on tasks involving:

Personal Finance & Wealth Management and Corporate Finance & Valuation—particularly those necessitating multi-constraint optimization (e.g., considering taxes, liquidity, risk simultaneously).
Quantitative and counterfactual reasoning—with model errors manifesting as both conceptual misapplications (selecting inappropriate financial methodologies) and arithmetic mistakes disruptive to real-world workflows.

These findings underscore that:

Model parameter scale alone does not confer domain expertise; pretraining data composition and domain-adaptive fine-tuning are critical.
External tool-based arithmetic assistance may reduce calculation errors, whereas conceptual shortcomings require focused domain instruction.
The FinForge pipeline’s dynamic construction capability supports continual benchmark generation—accommodating regulatory changes, new instruments, and emergent research, thus preserving test novelty and minimizing data leakage.

7. Significance and Prospects

FinForge-5k establishes a transparent, reproducible, and scalable framework for rigorous evaluation of LMs in finance. By exposing the granular contours of current model capabilities and failures, FinForge functions as both a testbed and a guide for the design of domain-aware pretraining regimens, dedicated financial reasoning architectures, and continual evaluation workflows in high-stakes professional environments. A plausible implication is that sustained use of such semi-synthetic benchmarks will drive targeted advances in LM competence aligned with the evolving demands of financial research and industry applications (Matlin et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

FinForge: Semi-Synthetic Financial Benchmark Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FinForge.