Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 61 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 171 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Reinforcement Finetuning (RFT) Overview

Updated 15 August 2025
  • RFT is a reinforcement learning strategy that fine-tunes pretrained models by maximizing reward signals instead of minimizing cross-entropy loss.
  • The approach employs policy-gradient methods like PPO to overcome the vanishing gradient issue linked to low reward variance in model outputs.
  • A two-stage pipeline using supervised preconditioning before RFT effectively boosts output diversity and enhances model alignment.

Reinforcement Fine-Tuning (RFT) refers to the adaptation of pretrained models—particularly LLMs and vision-LLMs (VLMs)—via reinforcement learning (RL) techniques that optimize model outputs against a programmatically defined reward signal. In contrast to supervised fine-tuning (SFT), which minimizes cross-entropy loss between generated and ground-truth outputs, RFT seeks to maximize a reward function that may reflect human preferences, task performance, factual correctness, or other alignment objectives. This approach has become central to model alignment pipelines for language, vision, and multimodal tasks, drawing widespread attention for both its empirical impact and its fundamental optimization challenges.

1. Theoretical Foundations and RFT Objective

RFT formulates the model adaptation problem as a policy-gradient RL task. Given a pretrained model parameterized by θ\theta, an input xx, and output yy, the optimization objective is to maximize the expected reward: V(θ)=ExD,ypθ(x)[r(x,y)],V(\theta) = \mathbb{E}_{x \sim D,\, y \sim p_{\theta}(\cdot|x)}[r(x, y)], where r(x,y)r(x,y) is a scalar reward function. The policy-gradient method computes

θV(θ)=Eypθ(x)[r(x,y)θlogpθ(yx)],\nabla_\theta V(\theta) = \mathbb{E}_{y \sim p_\theta(\cdot|x)}[r(x, y)\, \nabla_\theta \log p_\theta(y|x)],

often estimated via sampling. In practice, RFT typically uses Proximal Policy Optimization (PPO) or related algorithms (e.g., Group Relative Policy Optimization, GRPO), employing surrogate objectives, clipped importance weights, and value baselines to stabilize training (Razin et al., 2023).

For example, PPO uses the clipped objective: Lpolicy(θ)=E[min(πθ(as)πθold(as)A^,clip(πθ(as)πθold(as),1ϵ,1+ϵ)A^)],\mathcal{L}_\text{policy}(\theta) = -\mathbb{E}\Big[\min\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} \hat{A},\, \text{clip}(\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)}, 1-\epsilon, 1+\epsilon)\hat{A}\right)\Big], where A^\hat{A} is an advantage estimate (commonly via Generalized Advantage Estimation), and a KL penalty βKL[πθ(s)πref(s)]\beta\, \mathrm{KL}[\pi_\theta(\cdot|s)\|\pi_\text{ref}(\cdot|s)] ensures the updated policy remains close to a reference.

2. Vanishing Gradient Phenomenon in RFT

A critical discovery is that in many settings, the RFT gradient norm vanishes unless the model output distribution exhibits sufficient diversity in reward values. The magnitude of the gradient for a given input xx is provably upper-bounded by the reward standard deviation: θV(x;θ)6Loutγ(x;θ)[σpθ(x)[r(x,)]]2/3,\|\nabla_\theta V(x; \theta) \| \leq 6\, L_\text{out}\, \gamma(x; \theta)\, [\sigma_{p_\theta(\cdot|x)}[r(x,\cdot)]]^{2/3}, where LoutL_\text{out} is output sequence length, γ(x;θ)\gamma(x; \theta) bounds the logit Jacobian norm, and σpθ(x)[r(x,)]\sigma_{p_\theta(\cdot|x)}[r(x,\cdot)] is the standard deviation of rewards under the model's output distribution (Razin et al., 2023). If most probable yy receive almost identical rewards—even with the mean reward being sub-optimal—the resulting expected gradient is negligible and optimization stalls.

Experimental evidence on language generation (e.g., NarrativeQA, ToTTo) confirms that many inputs from pretrained models produce outputs with low reward standard deviation, and these "stuck" inputs show minimal improvement under RFT. In contrast, SFT is more effective in such cases because it provides direct supervision and yields nonvanishing gradient steps regardless of the output distribution's initial entropy.

3. Empirical Evidence and Controlled Experiments

Comprehensive experiments on the GRUE language generation benchmark and classification tasks (MNIST, CIFAR10, STS-B) demonstrate the prevalence and detrimental impact of vanishing gradients:

  • GRUE Tasks: Inputs with low reward standard deviation, even with low mean reward, demonstrate almost no reward improvement after RFT, while SFT produces clear and uniform gains.
  • Controlled-classification: With a finite set of possible outputs, exact computation of gradients is possible. RFT fails to improve for inputs lacking reward variability, whereas SFT achieves maximum accuracy efficiently.
  • Optimization time separation: Theoretically, RFT requires a number of steps Ω(1/σ2)\sim \Omega(1/\sigma^2) (where σ\sigma is pretraining reward standard deviation) for convergence, compared to the logarithmic time in 1/σ1/\sigma for SFT.

These results indicate that the optimization bottleneck is not merely due to sample inefficiency but is structurally inherent in the RL-based update.

4. Pipeline Solutions: Supervised Preconditioning and Remedies

Attempts to boost gradient signal with ad-hoc methods—higher learning rates, softmax temperature scaling, and entropy regularization—are shown to be largely ineffective or destabilizing for inputs with low reward variance. The dominant, effective remedy is a supervised fine-tuning (SFT) "preconditioning" phase:

  • SFT Phase: Brief application of cross-entropy supervised training on labeled data shifts the model's distribution, increasing reward standard deviation on many inputs and "unlocking" the RL gradient for subsequent RFT steps.
  • Minimal Label Demand: Experiments show that as little as 1% of the data and a small number of optimization steps suffice for this effect—a full-scale, expensive SFT phase is unnecessary.
  • Combined Pipeline: The recommended strategy is a two-stage approach: a diagnostic phase to identify low-variance inputs, partial SFT to move the output distributions out of degenerate regions, then RFT for alignment.

This approach allows practitioners to efficiently bypass the vanishing gradient barrier and maximize the impact of RL-based alignment.

5. Practical Deployment Considerations

The vanishing gradient phenomenon prescribes important considerations for any real-world RFT deployment:

  • Input Diagnostics: Implement reward variance estimation during the initial rollout sampling; identify and track inputs for which the variance is close to zero.
  • Computational Savings: By using minimal partial SFT, one can efficiently shift problematic inputs with negligible extra data labeling or compute.
  • Optimization Targeting: Focus RL updates on those examples where reward signal is present; consider skipping or flagging examples where the gradient is known to be vanishing, or adaptively schedule further SFT cycles.
  • Scalability: The two-stage pipeline is compatible with massive pre-trained models and scales to industrial data volumes due to the low cost of preconditioning.

A table summarizing the gradient bound and pipeline stages:

Phase Objective Key property
Preconditioning (SFT) Minimize cross-entropy on labeled data Increases reward std. dev., unlocks RL signal
Main RFT Maximize expected reward via policy gradient Effective only if reward std. dev. is nonzero

6. Broader Implications and Future Research

The challenge of vanishing gradients due to low reward variance is fundamental to all RFT (RLHF) approaches for LLM alignment. Being theoretically intrinsic to the reward maximization formulation and softmax parameterization, analogous phenomena are expected in other RL for structured prediction tasks whenever the model's policy is highly peaked and reward assignment is coarse. This points toward:

  • Diagnosis and monitoring: Systematic reward variance analysis during RL for all alignment pipelines.
  • Algorithm development: Exploration of alternative objectives, e.g., reward standard deviation-boosting regularizers or off-policy approaches less sensitive to initial reward collapse.
  • Generalization: The insight that initial output diversity is crucial for RL progress applies to other domains including visual-LLMs and multimodal alignment.
  • Efficient pipeline design: The result that SFT preconditioning can be lightweight stands to reduce resource requirements and accelerate development cycles for large model alignment.

In summary, the vanishing gradient problem in RFT highlights a precise link between the model's output diversity, reward landscape, and optimization efficacy. The recommended practical approach—diagnosis, partial SFT, and then RFT—forms a robust pipeline for scaling reinforcement learning–based alignment in neural sequence models (Razin et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Reinforcement Finetune (RFT).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube