Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 201 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

GVM-RAFT: Gradient Variance Minimization in RAFT

Updated 7 October 2025
  • GVM-RAFT is a framework that dynamically allocates sampling budgets based on prompt difficulty to optimize chain-of-thought reasoning.
  • It computes per-prompt sample allocations using acceptance rates and gradient norms to minimize stochastic gradient variance.
  • Empirical evaluations show up to 4× speedup and improved test accuracy in LLM fine-tuning across various reasoning tasks.

GVM-RAFT (Gradient Variance Minimization for RAFT) is a framework for optimizing chain-of-thought (CoT) reasoners—primarily LLMs trained via rejection sampling and reinforcement learning. It introduces a dynamic, prompt-level sample allocation strategy to minimize stochastic gradient variance when fine-tuning these models. This approach departs from traditional RAFT methods, which allocate inference budgets uniformly across prompts, instead adapting the number of samples per prompt based on observed difficulty metrics such as acceptance rate and gradient norm. The GVM-RAFT framework is theoretically supported by convergence analyses and demonstrates substantial empirical improvements in efficiency and accuracy.

1. Motivation and Conceptual Foundations

Traditional RAFT (Reward-Ranked Fine-Tuning) for chain-of-thought reasoning employs a uniform sample allocation across prompts during rejection sampling, where a fixed number of samples nn is drawn per prompt to approximate the latent rationale posterior. This practice does not account for the prompt-specific variation in difficulty (i.e., acceptance rate, gradient variability), resulting in inefficient estimation of gradients and unnecessary computational overhead. The core inefficiency is rooted in static budget allocation, which neglects the intractability and variability of sampling latent CoT traces across prompts.

GVM-RAFT is motivated by the need to reduce stochastic gradient estimator variance under a fixed total sampling budget, thereby accelerating EM-style optimization or RL fine-tuning for CoT reasoning tasks.

2. Dynamic Sample Allocation: Technical Formulation

GVM-RAFT computes per-prompt sample allocations to minimize the variance of the ELBO (Evidence Lower Bound) gradient estimator. For prompt xix_i at iteration tt, the number of samples nitn_i^t is dynamically determined. Let pitp_i^t be the acceptance rate (probability a sampled yP(yxi,θ)y \sim P(y|x_i,\theta) is accepted under a reward threshold), and GiG_i the estimated gradient norm:

Gradient estimator: i=1m1nitpityjDitθlnP(yj,zixi,θ)-\sum_{i=1}^{m} \frac{1}{n_i^t p_i^t} \sum_{y_j \in D_i^t} \nabla_{\theta} \ln P(y_j, z_i | x_i, \theta)

Variance upper bound: i=1mGi2nitpit\sum_{i=1}^{m} \frac{G_i^2}{n_i^t p_i^t}

To prevent excessive sampling on low-acceptance-rate prompts, a regularization term 1/[1+α/(pit)β]1/[1 + \alpha/(p_i^t)^\beta] is introduced, resulting in the modified objective: minimizei=1m11+α/(pit)βGi2pitnitsubject toi=1mnit=N\text{minimize} \quad \sum_{i=1}^{m}\frac{1}{1 + \alpha/(p_i^t)^\beta} \frac{G_i^2}{p_i^t n_i^t} \quad \text{subject to} \quad \sum_{i=1}^{m} n_i^t = N

Closed-form optimal allocation: nit=NGipit+α/(pit)β1/lGlplt+α/(plt)β1n_i^t = N \cdot \frac{G_i}{\sqrt{p_i^t + \alpha/(p_i^t)^{\beta-1}}} \Bigg/ \sum_{l} \frac{G_l}{\sqrt{p_l^t + \alpha/(p_l^t)^{\beta-1}}}

This ensures more samples for harder prompts (lower pitp_i^t, higher GiG_i).

3. Theoretical Analysis and Convergence Guarantees

Assuming the loss function lnP(y,zx,θ)-\ln P(y, z|x, \theta) is 1/γ1/\gamma-smooth, GVM-RAFT's convergence is bounded as: E[L(θKT)L(θ)]E[L(θ0)L(θ)]η2Δ1(k,T)+η22γΩ(k,T)\mathbb{E}[L(\theta_{KT}) - L(\theta^*)] - \mathbb{E}[L(\theta_0) - L(\theta^*)] \leq -\frac{\eta}{2} \Delta_1 (k, T) + \frac{\eta^2}{2\gamma} \Omega(k,T) Here, Δ1\Delta_1 sums squared gradient norms and Ω\Omega aggregates gradient variance per M-step. For sufficiently large NN, variance shrinks and the loss steadily decreases. Under smooth and convexity assumptions, loss decrease outpaces that of fixed-sample RAFT.

This suggests that optimal prompt-specific allocation leads to faster and more stable convergence in gradient-based optimization for LLM fine-tuning.

4. Empirical Evaluation and Results

On mathematical reasoning datasets (Math500, Minerva Math, Olympiad Bench), GVM-RAFT achieves between 2×2\times and 4×4\times speedup in convergence relative to vanilla RAFT (i.e., fewer gradient steps are required for target accuracy). Notably, final test accuracy is also significantly improved, indicating that the dynamic sampling yields higher-quality chain-of-thought completions.

GVM-RAFT is verified within RAFT++ and generalized rejection policy optimization (GRPO) variants. In both, prompt-level budget rebalancing directly translates into superior learning efficiency and end-task performance.

5. Generalization Beyond RAFT and Practical Applications

The dynamic sampling strategy of GVM-RAFT generalizes to other gradient-based reinforcement learning frameworks (e.g., GRPO), with analogous improvements in convergence and accuracy. Beyond mathematical reasoning, GVM-RAFT is applicable wherever latent variable posteriors are approximated via sampling methods—including EM frameworks and REINFORCE-style RL.

Notably, use cases comprise:

  • RLHF (Reinforcement Learning from Human Feedback) in LLM alignment
  • Synthetic data generation (improved sample efficiency)
  • Structured prediction with instance-varying inference quality

A plausible implication is expanded adoption in tasks where computational budgets are constrained and gradient variance is a major bottleneck.

6. Limitations, Future Directions, and Potential Improvements

Experiments in GVM-RAFT were conducted with Qwen-based models on math datasets; broader validation across architectures and domains remains outstanding. The framework's generality invites incorporation into PPO, REINFORCE, and prospective RLHF methods. Open exploration areas include optimizing update frequency for sample reallocation (balancing rebalancing overhead and convergence) and investigating trade-offs in further dynamic allocation regimes.

Future work is anticipated in scaling GVM within larger and more heterogeneous task, model, and RL environments, aiming to further minimize stochastic inefficiency in latent variable model optimization.

7. Summary

GVM-RAFT provides a theoretically supported and practically validated solution for variance minimization in prompt-wise rejection-sampling and RL-tuning for chain-of-thought LLMs. By redistributing the sampling budget in accordance with prompt-level difficulty signals, it enables accelerated convergence and improved reasoning accuracy, with demonstrated adaptability to broader reinforcement learning algorithms and training scenarios (Yao et al., 5 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GVM-RAFT.