Papers
Topics
Authors
Recent
2000 character limit reached

R1-Style Finetuning Process for Multimodal Models

Updated 19 December 2025
  • R1-Style Finetuning Process is a rule-based reinforcement method that integrates GRPO for stable optimization in multimodal settings.
  • It uses deterministic, ground-truth rewards to enhance reasoning, structured tool use, and spatial understanding in various models.
  • The process combines supervised fine-tuning with RL-based policy updates, yielding improved generalization and sample efficiency.

R1-Style Finetuning Process

R1-style finetuning is a reinforcement learning (RL)-based procedure designed to efficiently elicit advanced reasoning capabilities, structured tool use, and robust generalization in large language, vision-language, and multi-modal models. Core to this paradigm is the use of rule-based, deterministic rewards derived directly from ground-truth annotations or explicit task criteria, coupled with Group Relative Policy Optimization (GRPO)—a variant of proximal policy optimization (PPO) that normalizes policy gradients within a group of sampled rollouts to stabilize credit assignment. R1-style pipelines have become central in scientific advances across multimodal reasoning, spatial understanding, action planning, vision-based manipulation, and auto-formatted stepwise chain-of-thought generation.

1. General Principles and Motivation

R1-style finetuning extends earlier language-model reinforcement learning (as in DeepSeek-R1) to complex multimodal settings. The defining characteristics include:

  • Rule-based reward models: Instead of trained reward networks or human preference models, R1 applies deterministic reward functions directly grounded in task logic (e.g., exact match to ground-truth answer, numeric error tolerance, spatial IoU, etc.).
  • Group Relative Policy Optimization (GRPO): For each data point, multiple candidate outputs are sampled, evaluated, and their rewards normalized into group-wise advantages, which are then used to form clipped policy gradient estimators with a KL-regularization term to a reference policy.
  • Outcome-centric supervision: Rewards are assigned on final outcomes (e.g., task accuracy, format compliance) rather than process steps or tool usage policies, minimizing reward hacking pathways.
  • Curriculum integration: R1-style training typically follows or is interleaved with supervised fine-tuning (SFT) on high-quality seed data to anchor model syntax and feasibility, before RL-based optimization for strategy discovery and reward maximization.

This design aligns with findings that SFT alone may seed valid structure but is fundamentally limited by sample effect and exploration, while R1-style RL, particularly when initialized from SFT, can efficiently yield emergent reasoning and tool-use behaviors that generalize (Wu et al., 25 May 2025, Chen et al., 23 May 2025, DeepSeek-AI et al., 22 Jan 2025, Shen et al., 10 Apr 2025).

2. Formal Algorithmic Structure and Policy Objectives

The backbone of R1-style finetuning is the GRPO algorithm with per-group advantage normalization and PPO-style clipping. The canonical objective is (using multimodal notation):

maxθ  E[x]D,yπθ(x)[rϕ(x,y)]βDKL ⁣[πθ(x)πref(x)]\max_{\theta}\; \mathbb{E}_{[x] \sim \mathcal D,\, y \sim \pi_\theta(\cdot|x)} \left[\, r_\phi(x, y) \,\right] -\beta\, D_{KL}\!\left[\pi_\theta(\cdot | x) || \pi_\mathrm{ref}(\cdot | x)\right]

For each group of GG sampled rollouts {yi}i=1G\{y_i\}_{i=1}^{G}, with rewards rir_i, the normalized advantage is

A^i=rimeanjrjstdjrj\hat{A}_i = \frac{r_i - \mathrm{mean}_j r_j}{\mathrm{std}_j r_j}

And the clipped GRPO surrogate is

JGRPO(θ)=E{yi}πold[1Gi=1G1yit=1yimin(ri,t(θ)A^i,clip(ri,t(θ),1ϵ,1+ϵ)A^i)βDKL[πθπref]]\mathcal J_{\rm GRPO}(\theta) = \mathbb{E}_{\{y_i\} \sim \pi_\mathrm{old}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left( r_{i,t}(\theta)\, \hat{A}_i, \mathrm{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\, \hat{A}_i \right) - \beta\, D_{KL}[\pi_\theta \| \pi_\mathrm{ref}] \right]

where ri,t(θ)r_{i,t}(\theta) is the token-level likelihood ratio between πθ\pi_\theta and the reference policy.

Pseudocode for the main R1-style loop is as follows (Wu et al., 25 May 2025):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Algorithm VTool-R1 Finetuning Loop

Input: pretrained VLM π_ref, toolset T, dataset 𝒟, group size G, hyperparams ε, β, lr α
Initialize π_θ ← π_ref
repeat:
  1. Sample minibatch {[I^n, x^n]} from 𝒟
  2. For each example n:
    a. Sample G rollouts {y_i^n} from π_old:
      i. First pass: y′_i ∼ π_old(·|I^n, x^n)
      ii. If y′_i contains tool call, execute I′_i = T(y′_i, I^n); else I′_i=∅
      iii. Second pass: y_i ∼ π_old(·|I^n⊕I′_i, x^n; T)
    b. Evaluate rewards r_i^n = r_ϕ(I^n, x^n, y_i^n)
  3. Normalize {r_i^n} to get advantages ĤA_i^n
  4. Compute gradient ∇_θ 𝓙_GRPO(θ)
  5. Update θ ← θ + α·∇_θ 𝓙_GRPO(θ)
  6. Occasionally sync π_old ← π_θ
until convergence

The architecture is agnostic: language-only models, VLMs, or even dense visual representation transformers can all be optimized under this formalism by suitable definition of states, actions, and rewards (Wu et al., 25 May 2025, Pan et al., 29 May 2025, Masry et al., 13 Aug 2025, Li et al., 23 Jun 2025).

3. Reward Engineering and Environment Design

Crucial to R1-style RL is the engineering of verifiable, deterministic reward signals:

  • Task Accuracy: Binary or scalar, compares model output to ground-truth. E.g., VTool-R1 uses rϕ(I,x,y)=1r_\phi(I,x,y) = 1 if the final answer matches reference, $0$ otherwise (Wu et al., 25 May 2025); Smart-R1 uses the official RealismMeta metric for traffic trajectory rollouts (Pei et al., 28 Sep 2025).
  • Format Validation: Many tasks require strict syntax (e.g., JSON, XML tags, > /<answer> block delimitation); format rewards are binary or soft and can be composed with main task accuracy (Shen et al., 10 Apr 2025, Wang et al., 2 Jun 2025, Chen et al., 21 Jul 2025).

    • Numerical Sensitivity: For chart and visual QA, rewards may softly grade near-miss numeric predictions, e.g., rnum(p,a)=max(0,1min(δ,1))r_{\text{num}}(p,a) = \max(0, 1 - \min(\delta, 1)) with δ=pa/(0.05a)\delta = |p-a|/(0.05 |a|) for 5% tolerance (Chen et al., 21 Jul 2025).

    • Semantic Similarity: Some systems quantify answer correctness via sentence-BERT or LLM-based similarity for open-ended or partially structured answers (Wang et al., 2 Jun 2025).
    • Spatial/Consistency Rewards: SVQA-R1 adjusts rewards to penalize failures of spatial symmetry by comparing performance on original and mirror-flipped examples, enhancing spatial grounding (Wang et al., 2 Jun 2025).
    • Length or Repetition Regularization: Encouraging chain-of-thought brevity, logical completeness, and discouraging token repetitions via explicit reward terms (Xu et al., 29 Sep 2025, Tian et al., 16 Jun 2025).
    • Tool Use and Multi-step Environment: For tool-augmented models (VTool-R1, Ego-R1), outputs can contain callable code blocks whose effects are post-processed and fed back via the environment for multi-pass interaction (Wu et al., 25 May 2025, Tian et al., 16 Jun 2025).

    Outcome-based, rather than process-based, reward is typically emphasized, as process shaping rewards tend to be brittle or encourage reward hacking (Wu et al., 25 May 2025, Shen et al., 10 Apr 2025).

    4. Multi-stage Training Pipelines and Curricula

    R1-style pipelines are almost always multi-phase:

    1. Supervised Fine-Tuning (SFT) Stage
      • Trains the model to obey stepwise "chain-of-thought" syntax, form output structure, and ground generation in data distributions via cross-entropy on seed datasets.
      • Typical datasets range from 5 k (DeepSeek-R1 cold-start) to millions of triplets (BigCharts, SMART-R1).
      • After SFT, models produce syntactically valid, interpretable outputs but lack robustness and cannot recover from new-error states (DeepSeek-AI et al., 22 Jan 2025, Wu et al., 25 May 2025, Masry et al., 13 Aug 2025, Li et al., 23 Jun 2025).
    2. R1-style RL (GRPO) Stage
      • Policy is initialized to the SFT checkpoint and updated to maximize outcome-based rewards.
      • RL is run for 1–3 epochs over held-out data, using current policy to sample groups of outputs, evaluate verifiable rewards, normalize, and apply GRPO update with KL penalty.
      • For certain domains (e.g., SMART-R1 for multi-agent traffic), an alternation SFT→RFT→SFT schedule is empirically optimal, with post-RFT SFT restoring fidelity while preserving the exploration benefits of RL (Pei et al., 28 Sep 2025).
    3. Evaluation
      • Decoding is performed at temperature 1.0, typically with top-k or nucleus sampling; tool calls are executed "in the loop" and chain-of-thought structure is enforced by the prompt and output parser.
      • Task accuracy, generalization (OOD), and reasoning chain length are all tracked as key indicators.

    This curriculum ensures syntactic control, enables efficient RL convergence, and preserves overall capabilities and stability (DeepSeek-AI et al., 22 Jan 2025, Shen et al., 10 Apr 2025, Zhan et al., 23 Mar 2025).

    5. Architectural, Implementation, and Hyperparameter Best Practices

    R1-style pipelines require modest or no architecture modification beyond standard VLM or LLM backbones:

    • Frozen Vision and Lightweight Adapters: For stability, vision towers are often frozen, with LoRA modules attached to the language decoders (Shen et al., 10 Apr 2025).
    • No Critic Networks: GRPO is strictly actor-only; value baselines are estimated intra-group by mean reward normalization (Wu et al., 25 May 2025, Masry et al., 13 Aug 2025, Chen et al., 21 Jul 2025).
    • Ablation-Proven Hyperparameters:
      • Learning rates: 10610^{-6} (RL phase), 10510^{-5} (SFT phase).
      • Group sizes: Typical G=4G=4 to $8$ per prompt.
      • KL-penalty (β\beta): $0.01-0.1$ (tuned by grid search or inherited from prior art).
      • PPO clip range: ϵ=0.10.2\epsilon=0.1-0.2.
      • Optimizers: AdamW, with weight decay $1e-2$.
      • Gradient clipping: $1.0$ (global norm) to reduce GRPO-induced variance (Shen et al., 10 Apr 2025, Wu et al., 25 May 2025, Pan et al., 29 May 2025).
    • Batch/hardware: Scaling is typically bottlenecked on long context and image input sizes, with micro-batch sizes primary limited by 16 k token windows and large vision feature maps.
    • Platform Agnostic Action Spaces: GUI and mobile agent R1-style pipelines employ unified action-space codings, supporting cross-device and OS generalization (Luo et al., 14 Apr 2025, Gu et al., 25 Jun 2025).

    6. Empirical Effects, Generalization, and Limitations

    R1-style fine-tuning consistently yields state-of-the-art accuracy on reasoning, spatial, or grounded tasks and enables the emergence of multimodal chain-of-thought behaviors. Key empirical effects:

    A plausible implication is that R1-style finetuning, when initialized from high-quality SFT and coupled to precise, rule-based reward design, can both replicate and exceed the capabilities previously achievable only through massive-scale, human-preference RLHF, but in largely deterministic and highly reproducible fashion.


    References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to R1-Style Finetuning Process.