Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals (2506.02281v1)

Published 2 Jun 2025 in cs.LG and cs.AI

Abstract: Current Reinforcement Fine-tuning (RFT) paradigms for LLMs suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.

Summary

The paper introduces GAIN-RL, a framework that leverages intrinsic angle signals to enhance training efficiency in RL fine-tuning.
It exploits the angular concentration in token hidden states to strategically reorder data, achieving up to 2.5× faster training.
The study demonstrates GAIN-RL's wide applicability across diverse model sizes and tasks, potentially reducing computational costs in LLM fine-tuning.

An Analysis of Efficient RL Fine-Tuning in LLMs

The research paper titled "Angles Don’t Lie: Unlocking Training-Efficient RL Through the Model's Own Signals" introduces an innovative framework for enhancing the efficiency of Reinforcement Learning Fine-Tuning (RFT) in LLMs. This work addresses the issue of sample inefficiency endemic in existing RFT paradigms, introducing a model-intrinsic signal termed "angle concentration" which reveals the model's learning capacity for specific data.

Overview

Current approaches to RFT suffer from high computational costs and low sample efficiency, largely due to repetitive data exposure using uniform sampling techniques. While curriculum learning strategies have been employed, they often rely on heuristic difficulty metrics that overlook the signals generated by the models themselves. To counteract these shortcomings, the paper explores the angular distributions in token hidden state vectors and establishes a correlation with gradient values, thereby allowing the model's preference for high angle concentration data to inform the training process.

The core contribution of this research is the Gradient-driven Angle-Informed Navigated RL framework (GAIN-RL), which strategically selects training data using the angle concentration signal to optimize gradient updates. This method claims a significant enhancement in training efficiency, with empirical evaluations showing over 2.5× acceleration across various tasks and model sizes, illustrating that leveraging intrinsic model signals for data manipulation can substantially reduce training durations and improve performance.

Key Findings and Methodology

The paper introduces several theoretical insights into angle concentration in model training:

Layer-wise Angle Concentration Pattern: Early layers induce intra-segment angle concentration, while subsequent layers promote inter-segment concentration, facilitating effective information flow.
Epoch-wise Angle Concentration Pattern: Throughout training, both intra-segment and inter-segment angle concentrations intensify, suggesting a curriculum-like progression in learning data.
Data-wise Angle Concentration Pattern: High angle concentration samples are learned before low angle concentration samples, demonstrating a model-preferred ordering that optimizes gradient updates.

Based on these insights, GAIN-RL was developed with three primary components: data reordering, dynamic Gaussian sampling, and probability update based on real-time angle concentration and accuracy signals. This framework demonstrates a novel plug-and-play approach applicable to diverse model architectures and datasets, as indicated by robust experimental outcomes.

Implications and Future Directions

The implication of this work is significant in the context of RL and LLMs, where training costs continue to present a barrier to widespread application. Utilizing model-centric signals rather than external difficulty metrics offers a more tailored and efficient approach to training, potentially impacting both theoretical studies of model dynamics and practical applications in AI development domains.

Future research could explore enhancements to the angle concentration signal in varying contexts, such as pre-training or inference stages, to further model-centric optimizations. Moreover, the integration of GAIN-RL across different RL algorithms beyond GRPO and its effects on broader AI tasks warrant exploration, as demonstrated by the successful adaptation to PPO in the paper.

In conclusion, the paper delivers a compelling case for the importance of internal model signals in dictating efficient RL training regimes, underscoring a shift towards more sophisticated and nuanced methodologies in AI training techniques.

PDF Markdown

Related Papers

Find Related Papers