R1-Style Finetuning Process for Multimodal Models

Updated 19 December 2025

R1-Style Finetuning Process is a rule-based reinforcement method that integrates GRPO for stable optimization in multimodal settings.
It uses deterministic, ground-truth rewards to enhance reasoning, structured tool use, and spatial understanding in various models.
The process combines supervised fine-tuning with RL-based policy updates, yielding improved generalization and sample efficiency.

R1-Style Finetuning Process

R1-style finetuning is a reinforcement learning (RL)-based procedure designed to efficiently elicit advanced reasoning capabilities, structured tool use, and robust generalization in large language, vision-language, and multi-modal models. Core to this paradigm is the use of rule-based, deterministic rewards derived directly from ground-truth annotations or explicit task criteria, coupled with Group Relative Policy Optimization (GRPO)—a variant of proximal policy optimization (PPO) that normalizes policy gradients within a group of sampled rollouts to stabilize credit assignment. R1-style pipelines have become central in scientific advances across multimodal reasoning, spatial understanding, action planning, vision-based manipulation, and auto-formatted stepwise chain-of-thought generation.

1. General Principles and Motivation

R1-style finetuning extends earlier language-model reinforcement learning (as in DeepSeek-R1) to complex multimodal settings. The defining characteristics include:

Rule-based reward models: Instead of trained reward networks or human preference models, R1 applies deterministic reward functions directly grounded in task logic (e.g., exact match to ground-truth answer, numeric error tolerance, spatial IoU, etc.).
Group Relative Policy Optimization (GRPO): For each data point, multiple candidate outputs are sampled, evaluated, and their rewards normalized into group-wise advantages, which are then used to form clipped policy gradient estimators with a KL-regularization term to a reference policy.
Outcome-centric supervision: Rewards are assigned on final outcomes (e.g., task accuracy, format compliance) rather than process steps or tool usage policies, minimizing reward hacking pathways.
Curriculum integration: R1-style training typically follows or is interleaved with supervised fine-tuning (SFT) on high-quality seed data to anchor model syntax and feasibility, before RL-based optimization for strategy discovery and reward maximization.

This design aligns with findings that SFT alone may seed valid structure but is fundamentally limited by sample effect and exploration, while R1-style RL, particularly when initialized from SFT, can efficiently yield emergent reasoning and tool-use behaviors that generalize (Wu et al., 25 May 2025, Chen et al., 23 May 2025, DeepSeek-AI et al., 22 Jan 2025, Shen et al., 10 Apr 2025).

2. Formal Algorithmic Structure and Policy Objectives

The backbone of R1-style finetuning is the GRPO algorithm with per-group advantage normalization and PPO-style clipping. The canonical objective is (using multimodal notation):

$\max_{\theta}\; \mathbb{E}_{[x] \sim \mathcal D,\, y \sim \pi_\theta(\cdot|x)} \left[\, r_\phi(x, y) \,\right] -\beta\, D_{KL}\!\left[\pi_\theta(\cdot | x) || \pi_\mathrm{ref}(\cdot | x)\right]$

For each group of $G$ sampled rollouts $\{y_i\}_{i=1}^{G}$ , with rewards $r_i$ , the normalized advantage is

$\hat{A}_i = \frac{r_i - \mathrm{mean}_j r_j}{\mathrm{std}_j r_j}$

And the clipped GRPO surrogate is

$\mathcal J_{\rm GRPO}(\theta) = \mathbb{E}_{\{y_i\} \sim \pi_\mathrm{old}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left( r_{i,t}(\theta)\, \hat{A}_i, \mathrm{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\, \hat{A}_i \right) - \beta\, D_{KL}[\pi_\theta \| \pi_\mathrm{ref}] \right]$

where $r_{i,t}(\theta)$ is the token-level likelihood ratio between $\pi_\theta$ and the reference policy.

Pseudocode for the main R1-style loop is as follows (Wu et al., 25 May 2025):

Algorithm VTool-R1 Finetuning Loop

Input: pretrained VLM π_ref, toolset T, dataset 𝒟, group size G, hyperparams ε, β, lr α
Initialize π_θ ← π_ref
repeat:
  1. Sample minibatch {[I^n, x^n]} from 𝒟
  2. For each example n:
    a. Sample G rollouts {y_i^n} from π_old:
      i. First pass: y′_i ∼ π_old(·|I^n, x^n)
      ii. If y′_i contains tool call, execute I′_i = T(y′_i, I^n); else I′_i=∅
      iii. Second pass: y_i ∼ π_old(·|I^n⊕I′_i, x^n; T)
    b. Evaluate rewards r_i^n = r_ϕ(I^n, x^n, y_i^n)
  3. Normalize {r_i^n} to get advantages ĤA_i^n
  4. Compute gradient ∇_θ 𝓙_GRPO(θ)
  5. Update θ ← θ + α·∇_θ 𝓙_GRPO(θ)
  6. Occasionally sync π_old ← π_θ
until convergence

The architecture is agnostic: language-only models, VLMs, or even dense visual representation transformers can all be optimized under this formalism by suitable definition of states, actions, and rewards (Wu et al., 25 May 2025, Pan et al., 29 May 2025, Masry et al., 13 Aug 2025, Li et al., 23 Jun 2025).

3. Reward Engineering and Environment Design

Crucial to R1-style RL is the engineering of verifiable, deterministic reward signals:

Task Accuracy: Binary or scalar, compares model output to ground-truth. E.g., VTool-R1 uses $r_\phi(I,x,y) = 1$ if the final answer matches reference, $0$ otherwise (Wu et al., 25 May 2025); Smart-R1 uses the official RealismMeta metric for traffic trajectory rollouts (Pei et al., 28 Sep 2025).
Format Validation: Many tasks require strict syntax (e.g., JSON, XML tags, > /<answer> block delimitation); format rewards are binary or soft and can be composed with main task accuracy (Shen et al., 10 Apr 2025, Wang et al., 2 Jun 2025, Chen et al., 21 Jul 2025).
- Numerical Sensitivity: For chart and visual QA, rewards may softly grade near-miss numeric predictions, e.g., $r_{\text{num}}(p,a) = \max(0, 1 - \min(\delta, 1))$ with $\delta = |p-a|/(0.05 |a|)$ for 5% tolerance (Chen et al., 21 Jul 2025).
- Semantic Similarity: Some systems quantify answer correctness via sentence-BERT or LLM-based similarity for open-ended or partially structured answers (Wang et al., 2 Jun 2025).
- Spatial/Consistency Rewards: SVQA-R1 adjusts rewards to penalize failures of spatial symmetry by comparing performance on original and mirror-flipped examples, enhancing spatial grounding (Wang et al., 2 Jun 2025).
- Length or Repetition Regularization: Encouraging chain-of-thought brevity, logical completeness, and discouraging token repetitions via explicit reward terms (Xu et al., 29 Sep 2025, Tian et al., 16 Jun 2025).
- Tool Use and Multi-step Environment: For tool-augmented models (VTool-R1, Ego-R1), outputs can contain callable code blocks whose effects are post-processed and fed back via the environment for multi-pass interaction (Wu et al., 25 May 2025, Tian et al., 16 Jun 2025).
Outcome-based, rather than process-based, reward is typically emphasized, as process shaping rewards tend to be brittle or encourage reward hacking (Wu et al., 25 May 2025, Shen et al., 10 Apr 2025).

4. Multi-stage Training Pipelines and Curricula

R1-style pipelines are almost always multi-phase:
1. Supervised Fine-Tuning (SFT) Stage
  - Trains the model to obey stepwise "chain-of-thought" syntax, form output structure, and ground generation in data distributions via cross-entropy on seed datasets.
  - Typical datasets range from 5 k (DeepSeek-R1 cold-start) to millions of triplets (BigCharts, SMART-R1).
  - After SFT, models produce syntactically valid, interpretable outputs but lack robustness and cannot recover from new-error states (DeepSeek-AI et al., 22 Jan 2025, Wu et al., 25 May 2025, Masry et al., 13 Aug 2025, Li et al., 23 Jun 2025).
2. R1-style RL (GRPO) Stage
  - Policy is initialized to the SFT checkpoint and updated to maximize outcome-based rewards.
  - RL is run for 1–3 epochs over held-out data, using current policy to sample groups of outputs, evaluate verifiable rewards, normalize, and apply GRPO update with KL penalty.
  - For certain domains (e.g., SMART-R1 for multi-agent traffic), an alternation SFT→RFT→SFT schedule is empirically optimal, with post-RFT SFT restoring fidelity while preserving the exploration benefits of RL (Pei et al., 28 Sep 2025).
3. Evaluation
  - Decoding is performed at temperature 1.0, typically with top-k or nucleus sampling; tool calls are executed "in the loop" and chain-of-thought structure is enforced by the prompt and output parser.
  - Task accuracy, generalization (OOD), and reasoning chain length are all tracked as key indicators.
This curriculum ensures syntactic control, enables efficient RL convergence, and preserves overall capabilities and stability (DeepSeek-AI et al., 22 Jan 2025, Shen et al., 10 Apr 2025, Zhan et al., 23 Mar 2025).

5. Architectural, Implementation, and Hyperparameter Best Practices

R1-style pipelines require modest or no architecture modification beyond standard VLM or LLM backbones:
- Frozen Vision and Lightweight Adapters: For stability, vision towers are often frozen, with LoRA modules attached to the language decoders (Shen et al., 10 Apr 2025).
- No Critic Networks: GRPO is strictly actor-only; value baselines are estimated intra-group by mean reward normalization (Wu et al., 25 May 2025, Masry et al., 13 Aug 2025, Chen et al., 21 Jul 2025).
- Ablation-Proven Hyperparameters:
  - Learning rates: $10^{-6}$ (RL phase), $10^{-5}$ (SFT phase).
  - Group sizes: Typical $G=4$ to $8$ per prompt.
  - KL-penalty ( $\beta$ ): $0.01-0.1$ (tuned by grid search or inherited from prior art).
  - PPO clip range: $\epsilon=0.1-0.2$ .
  - Optimizers: AdamW, with weight decay $1e-2$.
  - Gradient clipping: $1.0$ (global norm) to reduce GRPO-induced variance (Shen et al., 10 Apr 2025, Wu et al., 25 May 2025, Pan et al., 29 May 2025).
- Batch/hardware: Scaling is typically bottlenecked on long context and image input sizes, with micro-batch sizes primary limited by 16 k token windows and large vision feature maps.
- Platform Agnostic Action Spaces: GUI and mobile agent R1-style pipelines employ unified action-space codings, supporting cross-device and OS generalization (Luo et al., 14 Apr 2025, Gu et al., 25 Jun 2025).
6. Empirical Effects, Generalization, and Limitations

R1-style fine-tuning consistently yields state-of-the-art accuracy on reasoning, spatial, or grounded tasks and enables the emergence of multimodal chain-of-thought behaviors. Key empirical effects:
- Emergence of "aha moments": Models transition from simple output enumeration to logical, multi-step reasoning (documented for object detection and chart reasoning) (Shen et al., 10 Apr 2025, Masry et al., 13 Aug 2025, Chen et al., 21 Jul 2025).
- Generalization and Robustness: Out-of-distribution (OOD) performance for VQA, chart QA, and spatial tasks is significantly improved compared to SFT-only methods (Masry et al., 13 Aug 2025, Wang et al., 2 Jun 2025).
- Reward Hacking Mitigation: Proper reward design (e.g., output length penalties, soft matching for numerics, KL regularization) is shown to be essential to prevent degenerate solutions (Shen et al., 10 Apr 2025, Chen et al., 21 Jul 2025).
- Sample Efficiency: Theoretical and empirical analysis indicates that SFT alone is bottlenecked by the sample effect of its data, but re-distillation of RL-enhanced outputs matches RL performance at a fraction of cost, provided that high-value-for-learning samples are prioritized (Chen et al., 23 May 2025).
- Stability: SFT seeds output syntax, and KL+PPO-style clipping ensure R1-style RL converges stably even in high-variance multimodal settings (Wu et al., 25 May 2025, Shen et al., 10 Apr 2025).
A plausible implication is that R1-style finetuning, when initialized from high-quality SFT and coupled to precise, rule-based reward design, can both replicate and exceed the capabilities previously achievable only through massive-scale, human-preference RLHF, but in largely deterministic and highly reproducible fashion.

References:
- "VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use" (Wu et al., 25 May 2025)
- "BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning" (Masry et al., 13 Aug 2025)
- "DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models" (Pan et al., 29 May 2025)
- "Vision-R1: Evolving Human-Free Alignment in Large Vision-LLMs via Vision-Guided Reinforcement Learning" (Zhan et al., 23 Mar 2025)
- "Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner" (Chen et al., 21 Jul 2025)
- "Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning" (Li et al., 23 Jun 2025)
- "Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards" (Gu et al., 25 Jun 2025)
- "Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning" (Chen et al., 23 May 2025)
- "VLM-R1: A Stable and Generalizable R1-style Large Vision-LLM" (Shen et al., 10 Apr 2025)
- "Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?" (Rouditchenko et al., 14 May 2025)
- "GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents" (Luo et al., 14 Apr 2025)
- "Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning" (Tian et al., 16 Jun 2025)
- "Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning" (Xu et al., 29 Sep 2025)
- "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (DeepSeek-AI et al., 22 Jan 2025)
- "Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning" (Pei et al., 28 Sep 2025)
- "SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization" (Wang et al., 2 Jun 2025)