Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

Reinforcement Finetuning (RFT) Overview

Updated 15 August 2025
  • RFT is a reinforcement learning strategy that fine-tunes pretrained models by maximizing reward signals instead of minimizing cross-entropy loss.
  • The approach employs policy-gradient methods like PPO to overcome the vanishing gradient issue linked to low reward variance in model outputs.
  • A two-stage pipeline using supervised preconditioning before RFT effectively boosts output diversity and enhances model alignment.

Reinforcement Fine-Tuning (RFT) refers to the adaptation of pretrained models—particularly LLMs and vision-LLMs (VLMs)—via reinforcement learning (RL) techniques that optimize model outputs against a programmatically defined reward signal. In contrast to supervised fine-tuning (SFT), which minimizes cross-entropy loss between generated and ground-truth outputs, RFT seeks to maximize a reward function that may reflect human preferences, task performance, factual correctness, or other alignment objectives. This approach has become central to model alignment pipelines for language, vision, and multimodal tasks, drawing widespread attention for both its empirical impact and its fundamental optimization challenges.

1. Theoretical Foundations and RFT Objective

RFT formulates the model adaptation problem as a policy-gradient RL task. Given a pretrained model parameterized by θ\theta, an input xx, and output yy, the optimization objective is to maximize the expected reward: V(θ)=ExD,ypθ(x)[r(x,y)],V(\theta) = \mathbb{E}_{x \sim D,\, y \sim p_{\theta}(\cdot|x)}[r(x, y)], where r(x,y)r(x,y) is a scalar reward function. The policy-gradient method computes

θV(θ)=Eypθ(x)[r(x,y)θlogpθ(yx)],\nabla_\theta V(\theta) = \mathbb{E}_{y \sim p_\theta(\cdot|x)}[r(x, y)\, \nabla_\theta \log p_\theta(y|x)],

often estimated via sampling. In practice, RFT typically uses Proximal Policy Optimization (PPO) or related algorithms (e.g., Group Relative Policy Optimization, GRPO), employing surrogate objectives, clipped importance weights, and value baselines to stabilize training (Razin et al., 2023).

For example, PPO uses the clipped objective: Lpolicy(θ)=E[min(πθ(as)πθold(as)A^,clip(πθ(as)πθold(as),1ϵ,1+ϵ)A^)],\mathcal{L}_\text{policy}(\theta) = -\mathbb{E}\Big[\min\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} \hat{A},\, \text{clip}(\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)}, 1-\epsilon, 1+\epsilon)\hat{A}\right)\Big], where A^\hat{A} is an advantage estimate (commonly via Generalized Advantage Estimation), and a KL penalty βKL[πθ(s)πref(s)]\beta\, \mathrm{KL}[\pi_\theta(\cdot|s)\|\pi_\text{ref}(\cdot|s)] ensures the updated policy remains close to a reference.

2. Vanishing Gradient Phenomenon in RFT

A critical discovery is that in many settings, the RFT gradient norm vanishes unless the model output distribution exhibits sufficient diversity in reward values. The magnitude of the gradient for a given input xx is provably upper-bounded by the reward standard deviation: θV(x;θ)6Loutγ(x;θ)[σpθ(x)[r(x,)]]2/3,\|\nabla_\theta V(x; \theta) \| \leq 6\, L_\text{out}\, \gamma(x; \theta)\, [\sigma_{p_\theta(\cdot|x)}[r(x,\cdot)]]^{2/3}, where LoutL_\text{out} is output sequence length, γ(x;θ)\gamma(x; \theta) bounds the logit Jacobian norm, and σpθ(x)[r(x,)]\sigma_{p_\theta(\cdot|x)}[r(x,\cdot)] is the standard deviation of rewards under the model's output distribution (Razin et al., 2023). If most probable yy receive almost identical rewards—even with the mean reward being sub-optimal—the resulting expected gradient is negligible and optimization stalls.

Experimental evidence on language generation (e.g., NarrativeQA, ToTTo) confirms that many inputs from pretrained models produce outputs with low reward standard deviation, and these "stuck" inputs show minimal improvement under RFT. In contrast, SFT is more effective in such cases because it provides direct supervision and yields nonvanishing gradient steps regardless of the output distribution's initial entropy.

3. Empirical Evidence and Controlled Experiments

Comprehensive experiments on the GRUE language generation benchmark and classification tasks (MNIST, CIFAR10, STS-B) demonstrate the prevalence and detrimental impact of vanishing gradients:

  • GRUE Tasks: Inputs with low reward standard deviation, even with low mean reward, demonstrate almost no reward improvement after RFT, while SFT produces clear and uniform gains.
  • Controlled-classification: With a finite set of possible outputs, exact computation of gradients is possible. RFT fails to improve for inputs lacking reward variability, whereas SFT achieves maximum accuracy efficiently.
  • Optimization time separation: Theoretically, RFT requires a number of steps Ω(1/σ2)\sim \Omega(1/\sigma^2) (where σ\sigma is pretraining reward standard deviation) for convergence, compared to the logarithmic time in 1/σ1/\sigma for SFT.

These results indicate that the optimization bottleneck is not merely due to sample inefficiency but is structurally inherent in the RL-based update.

4. Pipeline Solutions: Supervised Preconditioning and Remedies

Attempts to boost gradient signal with ad-hoc methods—higher learning rates, softmax temperature scaling, and entropy regularization—are shown to be largely ineffective or destabilizing for inputs with low reward variance. The dominant, effective remedy is a supervised fine-tuning (SFT) "preconditioning" phase:

  • SFT Phase: Brief application of cross-entropy supervised training on labeled data shifts the model's distribution, increasing reward standard deviation on many inputs and "unlocking" the RL gradient for subsequent RFT steps.
  • Minimal Label Demand: Experiments show that as little as 1% of the data and a small number of optimization steps suffice for this effect—a full-scale, expensive SFT phase is unnecessary.
  • Combined Pipeline: The recommended strategy is a two-stage approach: a diagnostic phase to identify low-variance inputs, partial SFT to move the output distributions out of degenerate regions, then RFT for alignment.

This approach allows practitioners to efficiently bypass the vanishing gradient barrier and maximize the impact of RL-based alignment.

5. Practical Deployment Considerations

The vanishing gradient phenomenon prescribes important considerations for any real-world RFT deployment:

  • Input Diagnostics: Implement reward variance estimation during the initial rollout sampling; identify and track inputs for which the variance is close to zero.
  • Computational Savings: By using minimal partial SFT, one can efficiently shift problematic inputs with negligible extra data labeling or compute.
  • Optimization Targeting: Focus RL updates on those examples where reward signal is present; consider skipping or flagging examples where the gradient is known to be vanishing, or adaptively schedule further SFT cycles.
  • Scalability: The two-stage pipeline is compatible with massive pre-trained models and scales to industrial data volumes due to the low cost of preconditioning.

A table summarizing the gradient bound and pipeline stages:

Phase Objective Key property
Preconditioning (SFT) Minimize cross-entropy on labeled data Increases reward std. dev., unlocks RL signal
Main RFT Maximize expected reward via policy gradient Effective only if reward std. dev. is nonzero

6. Broader Implications and Future Research

The challenge of vanishing gradients due to low reward variance is fundamental to all RFT (RLHF) approaches for LLM alignment. Being theoretically intrinsic to the reward maximization formulation and softmax parameterization, analogous phenomena are expected in other RL for structured prediction tasks whenever the model's policy is highly peaked and reward assignment is coarse. This points toward:

  • Diagnosis and monitoring: Systematic reward variance analysis during RL for all alignment pipelines.
  • Algorithm development: Exploration of alternative objectives, e.g., reward standard deviation-boosting regularizers or off-policy approaches less sensitive to initial reward collapse.
  • Generalization: The insight that initial output diversity is crucial for RL progress applies to other domains including visual-LLMs and multimodal alignment.
  • Efficient pipeline design: The result that SFT preconditioning can be lightweight stands to reduce resource requirements and accelerate development cycles for large model alignment.

In summary, the vanishing gradient problem in RFT highlights a precise link between the model's output diversity, reward landscape, and optimization efficacy. The recommended practical approach—diagnosis, partial SFT, and then RFT—forms a robust pipeline for scaling reinforcement learning–based alignment in neural sequence models (Razin et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)