Parameter-Level Characterization of RLVR

Updated 12 November 2025

Parameter-Level Characterization of RLVR is a detailed analysis that mathematically and empirically explains how reinforcement learning with verifiable rewards updates model weights.
The study shows that KL constraints, spectral geometry, and floating-point precision collectively steer updates into low-curvature, off-principal subspaces, preserving pretrained behavior.
Empirical comparisons reveal that RLVR achieves minimal spectral drift and reduced catastrophic forgetting relative to supervised fine-tuning, informing optimized fine-tuning strategies.

Reinforcement Learning with Verifiable Rewards (RLVR) is a paradigm for fine-tuning LLMs, primarily leveraging outcome-based, binary-verifiable supervision to improve model reasoning on tasks such as mathematics and programming. Parameter-level characterization of RLVR concerns the mathematical and empirical description of how the weights of these models evolve during RLVR post-training, addressing both the directionality and geometry of updates, their optimization regimes, and the factors affecting convergence, stability, and sparsity. This analytic perspective both distinguishes RLVR from supervised fine-tuning (SFT) and exposes the mechanistic underpinnings of its unique learning dynamics.

1. Parameter-Level Update Formulation in RLVR

RLVR optimizes model parameters θ to maximize expected reward via reinforcement learning, typically subject to a KL constraint relative to a reference policy. The one-step RLVR update can be written as the solution to

$\theta^+ \in \arg\min_\theta D_{\mathrm{KL}}(\tilde q_\beta\,\|\,\pi_\theta)$

where

$\tilde q_\beta(y|x)\propto\pi_{\rm ref}(y|x)\,e^{R(x,y)/\beta}$

and $R(x,y)$ is a verifiable reward.

At the parameter block level (e.g., for a linear layer $W$ ), curvature-constrained update size is bounded as

$\|\Delta W\|_F \leq \sqrt{\tfrac{2K}{\mu}}$

where $K \equiv D_{\mathrm{KL}}(\pi_{\theta^+}\|\pi_\theta)$ and $\mu$ is the smallest eigenvalue of the local Fisher information submatrix.

Within RLVR frameworks such as PACS, the policy model parameterizes both the policy $\pi_\theta$ and a score function $s_\theta(x,y)$ , mapping to a probability of correctness. PACS reformulates RLVR’s objective via a supervised cross-entropy loss over $(x, y, r(y))$ triples:

$L(\theta) = -\E_{x \sim P,\, y \sim \pi_\theta(\cdot|x)}\big[ r(y)\log\sigma(s_\theta(x,y)) + (1-r(y))\log(1-\sigma(s_\theta(x,y))) \big]$

with the gradient decomposition

$\nabla_\theta L(\theta) = \E\big[ \ell(x, y; \theta)\, \nabla_\theta \log \pi_\theta(y|x) \big] + \E\big[ (r(y)-\sigma(s_\theta))\,\nabla_\theta s_\theta \big]$

resulting in an implicit actor–critic structure operating within a single parameter set.

2. Three-Gate Theory of RLVR Parameter Dynamics

The evolution of parameters under RLVR is explained mechanistically by the Three-Gate Theory:

KL Anchor (Gate I): Each policy update is strictly KL-constrained, bounding weight change in parameter space. For sufficiently small steps,

$D_{\rm KL}(\pi_{\theta^+}\|\pi_\theta) \leq K$

which in turn controls the Fisher-weighted $\ell_2$ norm of $\Delta\theta$ .

Model Geometry (Gate II): The geometry of the pretrained model, embodied in its spectral decomposition, dictates that—under the KL leash—updates preferentially avoid principal (high-curvature) subspaces. Specifically, for a pretrained block $W_0$ :

$W_0 = \sum_i \sigma_i u_i v_i^\top$

Wedin’s sin–Θ theorem and Weyl’s bounds guarantee that for sufficiently small $\|\Delta W\|_F$ , subspace rotation ( $\Theta$ ) and spectral drift ( $\|\sigma(W_+)-\sigma(W_0)\|_F$ ) are minimal, keeping updates in spectrum-preserving, low-curvature directions.

Precision (Gate III): Given the floating-point format (e.g., bfloat16), micro-updates below a threshold (relative ULP $\approx$ 0.2–0.4%) are “invisible,” causing updates in non-preferred subspaces to appear sparse, although energy is distributed predominantly off-principal.

These gates cooperate to steer RLVR updates into subspaces unlikely to disrupt core pretrained behavior, resulting in improved stability and reduced catastrophic forgetting.

3. Metrics for Parameter-Space Characterization

Empirical analysis of RLVR’s parameter evolution employs quantitative metrics:

Metric	Definition	Significance
Spectral Drift ( $\Delta\Sigma$ )	$\\|\sigma(W_+)-\sigma(W_0)\\|_F$	Captures magnitude of singular value shift post-update
Principal-Subspace Rotation ( $\Theta$ )	$\arccos\\|U_{0,k}^\top U_{+,k}\\|_*$	Measures maximum angle rotation of top- $k$ invariant subspaces
Off-Principal Update Alignment ( $\alpha$ )	$1-\sum_{i=1}^k (u_i^\top \Delta w)^2/\\|\Delta w\\|^2$	Proportion of update norm lying outside top- $k$ directions

RLVR consistently exhibits $\Delta\Sigma/\|\sigma_0\|_2 < 0.01$ and subspace rotation $<5^\circ$ compared to SFT’s drift $>0.1$ and rotations $>15^\circ$ on models such as DS-Qwen-1.5B and Qwen3-8B.

4. Gradient Gap, Alignment, and Convergence Thresholds

A central theoretical quantity for RLVR optimization is the parameter-level (token-level) Gradient Gap. For an autoregressive policy $\pi_\theta$ and a fixed prompt $q$ ,

$\Delta\mu(\theta) := \E_{y\sim\pi^+_\theta}\big[\nabla_\theta \log\pi_\theta(y|q)\big] - \E_{y\sim\pi^-_\theta}\big[\nabla_\theta \log\pi_\theta(y|q)\big]$

where $\pi^+_\theta$ and $\pi^-_\theta$ are the conditional distributions over successful and failing trajectories. At iteration $k$ , the update direction $g_k$ yields gap alignment $A_k = \langle g_k, \Delta\mu(\theta_k)\rangle$ .

Convergence of the RLVR procedure is controlled by the magnitude and alignment of $A_k$ and the learning rate $\eta_k$ , with the main update bound:

$\logit(p_{k+1}) - \logit(p_k) \geq A_k\eta_k - C\eta_k^2$

where $C$ incorporates response length $T$ and token score norms. A critical threshold emerges: $\eta_k \lesssim \frac{A_k}{L T + G^2 T/(1-p_k)}$ requiring the learning rate to shrink with increasing response length $T$ and as success probability $p_k \to 1$ . A fixed $\eta$ will eventually violate this constraint, resulting in stagnation below perfect accuracy.

This theory mathematically justifies practical heuristics like length normalization, scaling $\eta\propto 1/T$ , and performance-aware scaling $\eta\propto (1-p_k)$ , and explains the limitations of naïve fixed-rate or step-size approaches.

5. Empirical Findings and Comparisons with Supervised Fine-Tuning

Empirical studies substantiating the above theory show:

Spectrum Preservation: RLVR’s trajectory maintains spectral drift and subspace rotation substantially below SFT, indicating preservation of core model representations.
Off-Principal Alignment: RLVR updates are consistently aligned with low-curvature, non-principal directions ( $\alpha \approx 0.9$ ), and overlap with principal weight masks falls below random, confirming avoidance of the “principal” spectrum.
Invariance and Consistency: The parameter regions updated under RLVR are highly consistent across RNG seeds, datasets, and RL variants, indicating strong model-intrinsic bias.
Sparsity Artifact: The apparent sparsity of RLVR-induced parameter changes is explained by micro-updates masked by bfloat16 ULP (Gate III), not true inactivity.
Intervention Validation: Destructive interventions (rotating principal subspaces, head permutations) collapse RLVR’s parameter-localization effect, demonstrating causality for Gate II.

Comparatively, SFT distorts principal directions, increases spectral drift, and tends to induce greater catastrophic forgetting. RLVR achieves equivalent or superior reasoning improvements with milder parameter adjustments, emphasizing its regime of minimal-disruption fine-tuning.

6. Implications for Fine-Tuning Strategies and Architecture Design

The parameter-level characterization of RLVR challenges the direct adaptation of SFT-era fine-tuning heuristics and PEFT methods. For instance, LoRA targeting low-rank, spectrum-complement subspaces matches full RLVR post-training dynamics, while PiSSA or masking based on principal components impedes or collapses RLVR effectiveness. Freezing principal subspaces slows learning and accuracy, whereas freezing the complement preserves RLVR trajectory, further emphasizing the protocol’s off-principal orientation.

A plausible implication is that new PEFT strategies for RLVR should explicitly exploit this geometric and spectral bias (“geometry-aware, RLVR-native learning algorithms”), moving away from principal-aligned SFT heuristics.

7. Summary and Outlook

Parameter-level analysis of RLVR establishes that policy optimization is strongly KL-anchored (Gate I), geometry-preserving (Gate II), and arguably ULP-masked for micro-updates (Gate III), resulting in off-principal, low-curvature parameter trajectory. Theoretical constructs such as the Gradient Gap, along with sharp step-size constraints, mathematically explain convergence behavior, stability advantages, and deployment heuristics unique to RLVR. Empirical results confirm that RLVR delivers robust, minimally disruptive post-training, contrasting starkly with SFT. These findings provide a substrate for further refinement of RLVR methods and optimization-aware model adaptation strategies (Li et al., 2 Sep 2025, Zhu et al., 11 Nov 2025, Suk et al., 9 Oct 2025).