Fin-R1: Financial Reasoning LLM

Updated 15 October 2025

Fin-R1 is a financial domain-specific LLM that integrates supervised fine-tuning with reinforcement learning for precise multi-step financial reasoning.
It leverages a high-quality dataset with 60K chain-of-thought annotations to enforce structured logical progression and numerical analysis.
The model achieves state-of-the-art performance on specialized financial benchmarks, outperforming larger models on complex decision tasks.

Fin-R1 refers to a LLM for financial reasoning, designed to deliver robust multi-step reasoning and decision-making capabilities on complex financial tasks. The model is constructed using a two-stage training pipeline that synergizes supervised fine-tuning (SFT) with reinforcement learning (RL), specifically tailored to the financial domain by means of a distilled financial reasoning dataset. Fin-R1 is parameter-efficient, based on a 7B weight configuration, and demonstrates state-of-the-art (SOTA) performance on specialized benchmarks (FinQA, ConvFinQA) while remaining competitive with, or outperforming, much larger models on a broad suite of financial reasoning and decision tasks (Liu et al., 20 Mar 2025).

1. Two-Stage Model and Data Architecture

Fin-R1 is built around a two-stage approach. First, dataset generation and distillation yields the Fin-R1-Data corpus: a high-quality, financial reasoning dataset containing approximately 60,091 annotated chains-of-thought (CoT) and non-reasoning financial questions. The CoT annotations employ standardized tagging with > and <answer> tags, enforcing both the presence of interpretable step-by-step reasoning and an explicit answer segment, ensuring the model explicitly models logical progression alongside numerical and contextual reasoning.

Model pretraining and initialization utilize the Qwen2.5-7B-Instruct neural backbone to strike a balance between parameter economy and learning capacity.

The training pipeline consists of:

Supervised Fine-Tuning (SFT): Model initialized from the Qwen2.5-7B is fine-tuned using triplets $v = (x, c, y^*)$ , where $x$ is the financial query, $c$ is the explicitly tagged reasoning trace, and $y^*$ is the standardized answer.

Reinforcement Learning (RL): RL is performed using the Group Relative Policy Optimization (GRPO) algorithm, leveraging dual reward signals (formatting and answer accuracy), further aligning the model's outputs to domain requirements and precision.

2. Training Methodology and Objective Function

The SFT step encodes both reasoning and answer generation directly:

Each training instance exposes the model to explicit CoT tags, enforcing structured multi-step logical thinking prior to answer rendering.

Format tags ensure outputs are robust to downstream evaluation and parsing.

In the RL phase, policy optimization follows a comparative group-based approach:

For each sample, the model generates a group of candidate outputs $\{o_1, ..., o_G\}$ under the previous policy $\pi_{\text{old}}$ .

Each output is rewarded for both correct format (presence of exactly one <think> and one <answer> segment) and answer correctness, validated by an external judge model (Qwen2.5-Max), with rewards allocated as 1 (semantic match) or 0 (otherwise).

The group-relative advantage for each candidate output is

$A_i = \frac{r_i - \mu}{\sigma}$

where $r_i$ is the reward for candidate $i$ and $\mu, \sigma$ are the sample group mean and standard deviation.

The GRPO objective is given by

$\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}\left[ \frac{1}{G} \sum_{i=1}^G \min\left( r_i^{\text{ratio}} \cdot A_i, \text{clip}(r_i^{\text{ratio}}, 1 - \varepsilon, 1 + \varepsilon)\cdot A_i \right) - \beta D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}}) \right]$

where $r_i^{\text{ratio}} = \frac{\pi_\theta(o_i|v)}{\pi_{\text{old}}(o_i|v)}$ and $D_{\text{KL}}$ is a regularization term for policy smoothness.

This dual-phase workflow yields a model highly attuned to both the formal structure and the substantive logic of financial Q&A.

3. Performance Metrics and Evaluation

Fin-R1 is evaluated across a multidataset battery of financial benchmarks:

On a portfolio of five standard datasets (FinQA, ConvFinQA, Ant-Finance, TFNS, Finance-Instruct-500K), it achieves an overall average score of 75.2.

SOTA results are demonstrated with 85.0 on ConvFinQA and 76.0 on FinQA: both tasks require chain-of-thought calculation, numerical precision, and multi-step symbolic manipulation.

Fin-R1 outperforms DeepSeek-R1-Distill-Llama-70B (69.2 overall) and even some larger models at 32B parameters, emphasizing training and architecture effectiveness at smaller scale.

This performance attests not just to surface-level answer accuracy, but to consistent multi-step reasoning—a crucial capability for high-stakes financial operations.

4. Financial Reasoning and Decision-Making Capabilities

Fin-R1 is optimized to handle reasoning patterns native to the financial domain:

Supports numerical computation, ratio assessment, decimal/percentage translation, and strict output formatting (for auditability).

Mastery of chain-of-thought style thinking allows not just single-step problem-solving but multi-hop compliance, automated regulatory checks, and investment logic unfolding.

Output tags (<think>, <answer>) are strictly enforced by the reward mechanism, providing both human interpretability and traceability in automated decision environments where justification is mandatory.

Examples in the model outputs include broken-down calculation steps, internal uncertainty management, and precise outcome reporting.

5. Comparative Analysis and State-of-the-Art Achievements

Fin-R1 achieves top-tier results among financial domain models of moderate size, with characteristics including:

Parameter efficiency: 7B scale yet competitive against, or superior to, 32B and 70B models.

Strong generalization to fragmented data, numerically precise tasks, and logical QA.

Benchmarks (FinQA, ConvFinQA) act as proxies for real-world scenarios such as document-based financial analysis, risk adjudication, and compliance checking.

The model's competitive advantage arises from the combination of a domain-tailored data pipeline, chain-of-thought enforcement, and targeted RL-facilitated alignment of output structure and content.

6. Open Source Availability

The Fin-R1 codebase is released at https://github.com/SUFE-AIFLM-Lab/Fin-R1, supporting:

Full reproducibility of the model experiments and evaluation statistics.

Community extension, adaptation to new financial data schemas, and further innovation on domain-specific LLM reasoning frameworks.

This facilitates adoption within both fintech engineering and academic research, incentivizing rapid iteration on domain grounding, explainability, and regulatory-oriented reasoning.

7. Broader Implications

Fin-R1 exemplifies a modern pipeline for financial AI:

Demonstrates that smaller, well-designed architectures can achieve SOTA on reasoning-intensive financial tasks when supplied with explicit chain-of-thought data and reinforced alignment.

Sets a new baseline for the deployment of reasoning LLMs in financial decision-making, compliance monitoring, robo-advisory, and automated reporting systems.

This approach is distinguished by explicit structural reasoning control, reinforcement learning alignment for regulated output forms, and codebase accessibility, positioning Fin-R1 as an influential model for continued integration of LLMs into high-stakes financial domains.

PDF Markdown Chat (Pro)

References (1)

Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Fin-R1.

Fin-R1: Financial Reasoning LLM

1. Two-Stage Model and Data Architecture

2. Training Methodology and Objective Function

3. Performance Metrics and Evaluation

4. Financial Reasoning and Decision-Making Capabilities

5. Comparative Analysis and State-of-the-Art Achievements

6. Open Source Availability

7. Broader Implications

Follow Topic

Continue Learning

Related Topics