Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 117 tok/s Pro

Kimi K2 201 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

GVM-RAFT: Gradient Variance Minimization in RAFT

Updated 7 October 2025

GVM-RAFT is a framework that dynamically allocates sampling budgets based on prompt difficulty to optimize chain-of-thought reasoning.
It computes per-prompt sample allocations using acceptance rates and gradient norms to minimize stochastic gradient variance.
Empirical evaluations show up to 4× speedup and improved test accuracy in LLM fine-tuning across various reasoning tasks.

GVM-RAFT (Gradient Variance Minimization for RAFT) is a framework for optimizing chain-of-thought (CoT) reasoners—primarily LLMs trained via rejection sampling and reinforcement learning. It introduces a dynamic, prompt-level sample allocation strategy to minimize stochastic gradient variance when fine-tuning these models. This approach departs from traditional RAFT methods, which allocate inference budgets uniformly across prompts, instead adapting the number of samples per prompt based on observed difficulty metrics such as acceptance rate and gradient norm. The GVM-RAFT framework is theoretically supported by convergence analyses and demonstrates substantial empirical improvements in efficiency and accuracy.

1. Motivation and Conceptual Foundations

Traditional RAFT (Reward-Ranked Fine-Tuning) for chain-of-thought reasoning employs a uniform sample allocation across prompts during rejection sampling, where a fixed number of samples $n$ is drawn per prompt to approximate the latent rationale posterior. This practice does not account for the prompt-specific variation in difficulty (i.e., acceptance rate, gradient variability), resulting in inefficient estimation of gradients and unnecessary computational overhead. The core inefficiency is rooted in static budget allocation, which neglects the intractability and variability of sampling latent CoT traces across prompts.

GVM-RAFT is motivated by the need to reduce stochastic gradient estimator variance under a fixed total sampling budget, thereby accelerating EM-style optimization or RL fine-tuning for CoT reasoning tasks.

2. Dynamic Sample Allocation: Technical Formulation

GVM-RAFT computes per-prompt sample allocations to minimize the variance of the ELBO (Evidence Lower Bound) gradient estimator. For prompt $x_i$ at iteration $t$ , the number of samples $n_i^t$ is dynamically determined. Let $p_i^t$ be the acceptance rate (probability a sampled $y \sim P(y|x_i,\theta)$ is accepted under a reward threshold), and $G_i$ the estimated gradient norm:

Gradient estimator: $-\sum_{i=1}^{m} \frac{1}{n_i^t p_i^t} \sum_{y_j \in D_i^t} \nabla_{\theta} \ln P(y_j, z_i | x_i, \theta)$

Variance upper bound: $\sum_{i=1}^{m} \frac{G_i^2}{n_i^t p_i^t}$

To prevent excessive sampling on low-acceptance-rate prompts, a regularization term $1/[1 + \alpha/(p_i^t)^\beta]$ is introduced, resulting in the modified objective: $\text{minimize} \quad \sum_{i=1}^{m}\frac{1}{1 + \alpha/(p_i^t)^\beta} \frac{G_i^2}{p_i^t n_i^t} \quad \text{subject to} \quad \sum_{i=1}^{m} n_i^t = N$

Closed-form optimal allocation: $n_i^t = N \cdot \frac{G_i}{\sqrt{p_i^t + \alpha/(p_i^t)^{\beta-1}}} \Bigg/ \sum_{l} \frac{G_l}{\sqrt{p_l^t + \alpha/(p_l^t)^{\beta-1}}}$

This ensures more samples for harder prompts (lower $p_i^t$ , higher $G_i$ ).

3. Theoretical Analysis and Convergence Guarantees

Assuming the loss function $-\ln P(y, z|x, \theta)$ is $1/\gamma$ -smooth, GVM-RAFT's convergence is bounded as: $\mathbb{E}[L(\theta_{KT}) - L(\theta^*)] - \mathbb{E}[L(\theta_0) - L(\theta^*)] \leq -\frac{\eta}{2} \Delta_1 (k, T) + \frac{\eta^2}{2\gamma} \Omega(k,T)$ Here, $\Delta_1$ sums squared gradient norms and $\Omega$ aggregates gradient variance per M-step. For sufficiently large $N$ , variance shrinks and the loss steadily decreases. Under smooth and convexity assumptions, loss decrease outpaces that of fixed-sample RAFT.

This suggests that optimal prompt-specific allocation leads to faster and more stable convergence in gradient-based optimization for LLM fine-tuning.

4. Empirical Evaluation and Results

On mathematical reasoning datasets (Math500, Minerva Math, Olympiad Bench), GVM-RAFT achieves between $2\times$ and $4\times$ speedup in convergence relative to vanilla RAFT (i.e., fewer gradient steps are required for target accuracy). Notably, final test accuracy is also significantly improved, indicating that the dynamic sampling yields higher-quality chain-of-thought completions.

GVM-RAFT is verified within RAFT++ and generalized rejection policy optimization (GRPO) variants. In both, prompt-level budget rebalancing directly translates into superior learning efficiency and end-task performance.

5. Generalization Beyond RAFT and Practical Applications

The dynamic sampling strategy of GVM-RAFT generalizes to other gradient-based reinforcement learning frameworks (e.g., GRPO), with analogous improvements in convergence and accuracy. Beyond mathematical reasoning, GVM-RAFT is applicable wherever latent variable posteriors are approximated via sampling methods—including EM frameworks and REINFORCE-style RL.

Notably, use cases comprise:

RLHF (Reinforcement Learning from Human Feedback) in LLM alignment
Synthetic data generation (improved sample efficiency)
Structured prediction with instance-varying inference quality

A plausible implication is expanded adoption in tasks where computational budgets are constrained and gradient variance is a major bottleneck.

6. Limitations, Future Directions, and Potential Improvements

Experiments in GVM-RAFT were conducted with Qwen-based models on math datasets; broader validation across architectures and domains remains outstanding. The framework's generality invites incorporation into PPO, REINFORCE, and prospective RLHF methods. Open exploration areas include optimizing update frequency for sample reallocation (balancing rebalancing overhead and convergence) and investigating trade-offs in further dynamic allocation regimes.

Future work is anticipated in scaling GVM within larger and more heterogeneous task, model, and RL environments, aiming to further minimize stochastic inefficiency in latent variable model optimization.

7. Summary

GVM-RAFT provides a theoretically supported and practically validated solution for variance minimization in prompt-wise rejection-sampling and RL-tuning for chain-of-thought LLMs. By redistributing the sampling budget in accordance with prompt-level difficulty signals, it enables accelerated convergence and improved reasoning accuracy, with demonstrated adaptability to broader reinforcement learning algorithms and training scenarios (Yao et al., 5 May 2025).

PDF Markdown Chat (Pro)

References (1)

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL (2025)

Follow Topic

Get notified by email when new papers are published related to GVM-RAFT.