Papers
Topics
Authors
Recent
2000 character limit reached

SR-GRPO: Stable Rank LLM Alignment

Updated 9 December 2025
  • The paper introduces SR-GRPO, a reinforcement learning method that uses the stable rank of hidden activations as an intrinsic reward for robust LLM alignment.
  • SR-GRPO employs group relative policy optimization to standardize advantages across response candidates, improving structured reasoning and dialogue quality.
  • Extensive empirical results show that SR-GRPO outperforms conventional baselines and alternative metrics, achieving high zero-shot reward prediction accuracy.

Stable Rank Group Relative Policy Optimization (SR-GRPO) is a reinforcement learning framework for LLM alignment, utilizing the stable rank of model hidden states as a dense, annotation-free intrinsic reward. SR-GRPO leverages geometric properties of neural activations to measure response quality without external supervision or human-preference labels, offering a pathway to scalable alignment absent learned reward models or preference annotations. This approach demonstrates high empirical validity as both a zero-shot reward proxy and the core reward for RL alignment, outperforming conventional baselines on structured reasoning and open-ended dialogue tasks (Tang et al., 2 Dec 2025).

1. Stable Rank: Definition and Theoretical Foundation

Let HRT×dH \in \mathbb{R}^{T \times d} denote the matrix of final-layer hidden activations for a response of TT tokens and hidden dimension dd. The singular values {σi}i=1min(T,d)\{\sigma_i\}_{i=1}^{\min(T, d)} of HH (ordered σ1σ2...0\sigma_1 \geq \sigma_2 \geq ... \geq 0) define its stable rank as:

SR(H)=HF2H22=i=1min(T,d)σi2σ12\operatorname{SR}(H) = \frac{\|H\|_F^2}{\|H\|_2^2} = \frac{\sum_{i=1}^{\min(T, d)} \sigma_i^2}{\sigma_1^2}

Stable rank quantifies the "effective dimensionality" of activations: SR(H)1\operatorname{SR}(H) \approx 1 signals near-total variance in one direction (representation collapse), while SR(H)rank(H)\operatorname{SR}(H) \rightarrow \operatorname{rank}(H) as variance disperses evenly. This measure is strongly motivated by Softmax Bottleneck theory, which posits that natural language modeling requires navigation of a high-rank semantic manifold; reduced or collapsed rank impairs output expressiveness. High-quality responses maintain coherent, information-rich latent trajectories, while error states, hallucinations, or repetition are linked to reduced effective dimensionality (Tang et al., 2 Dec 2025).

2. Rationale for Stable Rank as an Alignment Signal

Stable rank provides an intrinsic, annotation-free reward that correlates with response quality. Unlike human annotation or reward models, which are susceptible to subjectivity, reward hacking, and limited by data scarcity, stable rank is derived solely from hidden activations. Empirical analysis reveals that stable rank:

  • Achieves 84.04% zero-shot preference prediction accuracy on RewardBench (Qwen3-8B), matching or exceeding learned reward and self-evaluation baselines (Pointwise 83.70%, IPO 78.02%).
  • On Qwen2.5-1.5B, stable rank retains 75.95% accuracy versus 65.85% for IPO.
  • Maintains over 70% accuracy when computed at the final transformer layer, with substantial drop-off (∼50%) from earlier layers.

Ablation studies confirm the robustness of stable rank: its performance is resilient to prompt concatenation format (variation ≤3 accuracy points), only modestly impacted by context window truncation above 512 tokens (84.0% vs 84.04% at 4096), and significantly superior to intrinsic-dimension alternatives (condition number 36.0%, PCA-95% 61.9%, effective rank 54.5%) (Tang et al., 2 Dec 2025).

3. SR-GRPO Reinforcement Learning Objective

SR-GRPO trains a policy πϕ\pi_\phi (parameterized via LoRA adapters) against a frozen reference policy πref\pi_{\text{ref}}, using stable rank as the reward. For each prompt xx, KK candidate completions y1,,yKπϕ(x)y_1, \ldots, y_K \sim \pi_\phi(\cdot|x) are generated. The stable rank reward rk=SR(Hk)r_k = \operatorname{SR}(H_k) is computed from πref\pi_{\text{ref}}. Within each prompt group, rewards are standardized:

  • μ=1Kkrk\mu = \frac{1}{K} \sum_k r_k
  • σ=1Kk(rkμ)2\sigma = \sqrt{\frac{1}{K} \sum_k (r_k - \mu)^2}
  • Ak=rkμσ+ϵA_k = \frac{r_k - \mu}{\sigma + \epsilon} (advantage, ϵ=108\epsilon=10^{-8})

The RL objective is:

J(ϕ)=ExD[1Kk=1KρkAkβDKL[πϕ(x)πref(x)]]J(\phi) = \mathbb{E}_{x \sim D} \left[ \frac{1}{K} \sum_{k=1}^K \rho_k A_k - \beta D_{\mathrm{KL}}[\pi_\phi(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)] \right]

where ρk=πϕ(ykx)πϕold(ykx)\rho_k = \frac{\pi_\phi(y_k|x)}{\pi_{\phi_{\text{old}}}(y_k|x)} (importance weight), and β\beta is the KL penalty (default β=0.04\beta=0.04). This group-centric approach emphasizes relative advantages among contemporaneously sampled responses (Tang et al., 2 Dec 2025).

4. SR-GRPO Algorithmic Procedure

SR-GRPO employs "Group Relative Policy Optimization" to maximize the stable rank-derived objective on batches of prompts. The principal workflow:

  1. Initialization: Adapter parameters ϕ\phi set to the reference policy (ϕref\phi_{\text{ref}}).
  2. Sampling: For each batch, prompts {xi}\{x_i\} are drawn.
  3. Candidate Generation: For each xix_i, sample KK responses {yi,k}\{y_{i,k}\} from πϕ\pi_\phi.
  4. Reward Calculation: Evaluate hidden-state matrices Hi,kH_{i,k} using πref\pi_{\text{ref}} and compute ri,k=SR(Hi,k)r_{i,k} = \operatorname{SR}(H_{i,k}).
  5. Advantage Computation: Standardize within batch: compute mean μi\mu_i, std σi\sigma_i, advantages Ai,kA_{i,k}.
  6. Objective Evaluation: Compute policy gradient using group-wise advantages and KL penalty.
  7. Update: Apply AdamW to LoRA adapter parameters.

Experimental hyperparameters include LoRA rank=16, α=32\alpha=32, learning rate 1×1061 \times 10^{-6}, group size K=8K=8, batch size 128, training steps up to 400 (Qwen2.5-1.5B) or 300 (DeepSeek-R1). Evaluation is carried out on a range of STEM, mathematical, and open-ended chat benchmarks (Tang et al., 2 Dec 2025).

Algorithm Summary Table

Step Description Default Values
Sampling Draw batch of prompts {xi}\{x_i\} Batch size = 128
Completions For each xix_i, sample KK responses from πϕ\pi_\phi K=8K=8
Reward Compute ri,k=SR(Hi,k)r_{i,k} = \operatorname{SR}(H_{i,k}) on πref\pi_{\text{ref}} LoRA rank=16, α=32\alpha=32
Policy Update Compute group-relative advantages, KL penalty, apply AdamW optimizer β=0.04\beta=0.04

5. Experimental Setup and Baselines

SR-GRPO evaluations utilize:

  • Models: Qwen2.5-1.5B-Instruct as the base policy; DeepSeek-R1-Distill-Qwen-1.5B for reasoning.
  • Training Data: SmolTalk2 prompts without preference labels.
  • Benchmarks: RewardBench for zero-shot reward accuracy (2,985 pairwise preferences), STEM QA (GPQA, MMLU-redux), mathematical reasoning (MATH500, AIME25, OlympiadBench, AMC23), and WildBench for chat quality (GPT-4o-mini Elo).
  • Baselines: Learned Reward Model (Skywork-Reward-V2-Qwen3-1.7B), Self-Reward (pointwise), Perplexity (-NLL), IPO (Yes/No classifier).

6. Empirical Results and Ablation Studies

SR-GRPO demonstrates the following key outcomes:

  • Zero-shot Reward Accuracy: Stable rank achieves 84.04% on RewardBench (Qwen3-8B), 75.95% (Qwen2.5-1.5B), exceeding learned reward and IPO baselines.
  • Decoding Gains: Best-of-16 sampling plus SR selection yields +11.3 points average accuracy over greedy decoding for STEM/math tasks, outperforming random@N by 5–34% relative.
  • Alignment Improvements (Label-free RL):
    • Qwen2.5-1.5B-Instruct: STEM average +1.2 pts (33.3→34.5), Math average +4.4 pts (28.0→32.4; +19% relative), WildBench Elo +26.2.
    • DeepSeek-R1-1.5B: STEM average +2.6 pts (35.8→38.4), Math average +6.2 pts (58.5→64.7; +10.6%), WildBench Elo +19.0.
  • Ablations:
    • Cross-layer: Final-layer SR is critical, middle-layer SR is uninformative.
    • Context window: Windows ≥512 tokens preserve performance, but 128 tokens causes drop (62.6% accuracy).
    • Prompt concatenation: Format choice has minimal impact (≤3 pts variation).
    • Alternative metrics: Condition number, effective rank, and PCA-based intrinsic dimension are all inferior to stable rank as a reward signal.

A plausible implication is that the stable rank provides a uniquely relevant geometric signal for response quality in LLMs, distinct from existing intrinsic-dimension metrics and robust across prompt and window configurations (Tang et al., 2 Dec 2025).

7. Significance and Limitations

SR-GRPO establishes a geometric, annotation-free basis for LLM alignment, bypassing external preference models and human-labeled data. This framework demonstrates practical improvements in both structured and open-ended benchmarks, suggesting the viability of internal representation geometry as a proxy for desirable model behavior. The empirical efficacy of stable rank further highlights the limitations of competing internal metrics and positions model-layer geometry as a focal area for future alignment strategies.

A key limitation is the reliance on final-layer activations of a large reference policy; performance drops significantly when deriving stable rank from earlier layers or reduced context size. While stable rank outperforms other geometry-based alternatives, further exploration is warranted for scaling, interpretability, and potential integration with multi-signal reward frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Stable Rank Group Relative Policy Optimization (SR-GRPO).