SR-GRPO: Stable Rank LLM Alignment
- The paper introduces SR-GRPO, a reinforcement learning method that uses the stable rank of hidden activations as an intrinsic reward for robust LLM alignment.
- SR-GRPO employs group relative policy optimization to standardize advantages across response candidates, improving structured reasoning and dialogue quality.
- Extensive empirical results show that SR-GRPO outperforms conventional baselines and alternative metrics, achieving high zero-shot reward prediction accuracy.
Stable Rank Group Relative Policy Optimization (SR-GRPO) is a reinforcement learning framework for LLM alignment, utilizing the stable rank of model hidden states as a dense, annotation-free intrinsic reward. SR-GRPO leverages geometric properties of neural activations to measure response quality without external supervision or human-preference labels, offering a pathway to scalable alignment absent learned reward models or preference annotations. This approach demonstrates high empirical validity as both a zero-shot reward proxy and the core reward for RL alignment, outperforming conventional baselines on structured reasoning and open-ended dialogue tasks (Tang et al., 2 Dec 2025).
1. Stable Rank: Definition and Theoretical Foundation
Let denote the matrix of final-layer hidden activations for a response of tokens and hidden dimension . The singular values of (ordered ) define its stable rank as:
Stable rank quantifies the "effective dimensionality" of activations: signals near-total variance in one direction (representation collapse), while as variance disperses evenly. This measure is strongly motivated by Softmax Bottleneck theory, which posits that natural language modeling requires navigation of a high-rank semantic manifold; reduced or collapsed rank impairs output expressiveness. High-quality responses maintain coherent, information-rich latent trajectories, while error states, hallucinations, or repetition are linked to reduced effective dimensionality (Tang et al., 2 Dec 2025).
2. Rationale for Stable Rank as an Alignment Signal
Stable rank provides an intrinsic, annotation-free reward that correlates with response quality. Unlike human annotation or reward models, which are susceptible to subjectivity, reward hacking, and limited by data scarcity, stable rank is derived solely from hidden activations. Empirical analysis reveals that stable rank:
- Achieves 84.04% zero-shot preference prediction accuracy on RewardBench (Qwen3-8B), matching or exceeding learned reward and self-evaluation baselines (Pointwise 83.70%, IPO 78.02%).
- On Qwen2.5-1.5B, stable rank retains 75.95% accuracy versus 65.85% for IPO.
- Maintains over 70% accuracy when computed at the final transformer layer, with substantial drop-off (∼50%) from earlier layers.
Ablation studies confirm the robustness of stable rank: its performance is resilient to prompt concatenation format (variation ≤3 accuracy points), only modestly impacted by context window truncation above 512 tokens (84.0% vs 84.04% at 4096), and significantly superior to intrinsic-dimension alternatives (condition number 36.0%, PCA-95% 61.9%, effective rank 54.5%) (Tang et al., 2 Dec 2025).
3. SR-GRPO Reinforcement Learning Objective
SR-GRPO trains a policy (parameterized via LoRA adapters) against a frozen reference policy , using stable rank as the reward. For each prompt , candidate completions are generated. The stable rank reward is computed from . Within each prompt group, rewards are standardized:
- (advantage, )
The RL objective is:
where (importance weight), and is the KL penalty (default ). This group-centric approach emphasizes relative advantages among contemporaneously sampled responses (Tang et al., 2 Dec 2025).
4. SR-GRPO Algorithmic Procedure
SR-GRPO employs "Group Relative Policy Optimization" to maximize the stable rank-derived objective on batches of prompts. The principal workflow:
- Initialization: Adapter parameters set to the reference policy ().
- Sampling: For each batch, prompts are drawn.
- Candidate Generation: For each , sample responses from .
- Reward Calculation: Evaluate hidden-state matrices using and compute .
- Advantage Computation: Standardize within batch: compute mean , std , advantages .
- Objective Evaluation: Compute policy gradient using group-wise advantages and KL penalty.
- Update: Apply AdamW to LoRA adapter parameters.
Experimental hyperparameters include LoRA rank=16, , learning rate , group size , batch size 128, training steps up to 400 (Qwen2.5-1.5B) or 300 (DeepSeek-R1). Evaluation is carried out on a range of STEM, mathematical, and open-ended chat benchmarks (Tang et al., 2 Dec 2025).
Algorithm Summary Table
| Step | Description | Default Values |
|---|---|---|
| Sampling | Draw batch of prompts | Batch size = 128 |
| Completions | For each , sample responses from | |
| Reward | Compute on | LoRA rank=16, |
| Policy Update | Compute group-relative advantages, KL penalty, apply AdamW optimizer |
5. Experimental Setup and Baselines
SR-GRPO evaluations utilize:
- Models: Qwen2.5-1.5B-Instruct as the base policy; DeepSeek-R1-Distill-Qwen-1.5B for reasoning.
- Training Data: SmolTalk2 prompts without preference labels.
- Benchmarks: RewardBench for zero-shot reward accuracy (2,985 pairwise preferences), STEM QA (GPQA, MMLU-redux), mathematical reasoning (MATH500, AIME25, OlympiadBench, AMC23), and WildBench for chat quality (GPT-4o-mini Elo).
- Baselines: Learned Reward Model (Skywork-Reward-V2-Qwen3-1.7B), Self-Reward (pointwise), Perplexity (-NLL), IPO (Yes/No classifier).
6. Empirical Results and Ablation Studies
SR-GRPO demonstrates the following key outcomes:
- Zero-shot Reward Accuracy: Stable rank achieves 84.04% on RewardBench (Qwen3-8B), 75.95% (Qwen2.5-1.5B), exceeding learned reward and IPO baselines.
- Decoding Gains: Best-of-16 sampling plus SR selection yields +11.3 points average accuracy over greedy decoding for STEM/math tasks, outperforming random@N by 5–34% relative.
- Alignment Improvements (Label-free RL):
- Qwen2.5-1.5B-Instruct: STEM average +1.2 pts (33.3→34.5), Math average +4.4 pts (28.0→32.4; +19% relative), WildBench Elo +26.2.
- DeepSeek-R1-1.5B: STEM average +2.6 pts (35.8→38.4), Math average +6.2 pts (58.5→64.7; +10.6%), WildBench Elo +19.0.
- Ablations:
- Cross-layer: Final-layer SR is critical, middle-layer SR is uninformative.
- Context window: Windows ≥512 tokens preserve performance, but 128 tokens causes drop (62.6% accuracy).
- Prompt concatenation: Format choice has minimal impact (≤3 pts variation).
- Alternative metrics: Condition number, effective rank, and PCA-based intrinsic dimension are all inferior to stable rank as a reward signal.
A plausible implication is that the stable rank provides a uniquely relevant geometric signal for response quality in LLMs, distinct from existing intrinsic-dimension metrics and robust across prompt and window configurations (Tang et al., 2 Dec 2025).
7. Significance and Limitations
SR-GRPO establishes a geometric, annotation-free basis for LLM alignment, bypassing external preference models and human-labeled data. This framework demonstrates practical improvements in both structured and open-ended benchmarks, suggesting the viability of internal representation geometry as a proxy for desirable model behavior. The empirical efficacy of stable rank further highlights the limitations of competing internal metrics and positions model-layer geometry as a focal area for future alignment strategies.
A key limitation is the reliance on final-layer activations of a large reference policy; performance drops significantly when deriving stable rank from earlier layers or reduced context size. While stable rank outperforms other geometry-based alternatives, further exploration is warranted for scaling, interpretability, and potential integration with multi-signal reward frameworks.