- The paper demonstrates that a carefully selected subset of 1,389 RL training samples can achieve performance comparable to the full dataset of 8,523 samples.
- It introduces LIM, a method that evaluates training samples via normalized alignment scores based on model learning trajectories.
- LIMR outperforms baseline methods, achieving over 100% improvement on AIME24 and significant gains on AMC23 and MATH500 benchmarks.
The paper "LIMR: Less is More for RL Scaling" explores the effectiveness of reinforcement learning (RL) training data for enhancing LLMs' (LLM) reasoning capabilities. It addresses the lack of transparency regarding training data requirements in prior work, which has impeded systematic progress in the field. The authors challenge the assumption that scaling up RL training data inherently improves performance and demonstrate that a strategically selected subset of training samples can outperform the full dataset. They introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories.
The paper makes the following claims:
- A carefully selected subset of RL training samples (1,389) can achieve comparable or superior performance compared to training with the full dataset (8,523), demonstrating that the quality and relevance of training samples matter more than their quantity.
- LIM effectively predicts which samples will contribute most significantly to model improvement, eliminating the need for manual sample curation and making the methodology easily scalable.
- Recent data-efficient approaches like LIMO and s1, which show promise with 32B-scale models via supervised fine-tuning (SFT), significantly underperform at 7B-scale. However, the RL-based LIMR achieves 16.7\% higher accuracy on AIME24 and surpasses LIMO and s1 by 13.0\% and 22.2\% on MATH500, suggesting that RL may be more effective for enhancing reasoning capabilities in data-sparse scenarios.
To quantify and optimize the value of training data in RL, the authors present LIM, which systematically analyzes learning dynamics to identify the most effective training samples. They conducted an extensive analysis using the MATH-FULL dataset, which contains 8,523 mathematical problems of varying difficulty levels (3-5), revealing that different training samples contribute unequally to model learning. The core of LIM centers on a model-aligned trajectory analysis that evaluates training samples based on their contribution to model learning. Given that neural network learning typically follows a logarithmic growth pattern, the model's average reward curve is used as a reference for measuring sample effectiveness. The model's average reward curve is defined as:
ravgk=N1∑i=1Nrik,k=1,...,K
where:
- rik represents the reward of sample i at epoch k
- N is the total number of samples
- K is the total number of epochs
For each sample, LIM computes a normalized alignment score:
$\text{s}_i = 1 - \frac{\sum_{k=1}^{K} (r_i^k - r_{\text{avg}^k)^2}{\sum_{k=1}^{K} (1 - r_{\text{avg}^k)^2}, i=1, ..., N}$
where:
- si is the normalized alignment score for sample i
- rik represents the reward of sample i at epoch k
- ravgk is the average reward at epoch k
- K is the total number of epochs
- N is the total number of samples
Based on the alignment scores, LIM implements a selective sampling strategy: si>θ, where θ serves as a quality threshold. In the experiments, setting θ=0.6 yielded an optimized dataset (LIMR) of 1,389 high-value samples from the original dataset.
The paper compares LIM with several baseline data selection methods:
- RAND: Randomly selects 1,389 samples from MATH-FULL to match the size of LIMR, serving as a reference point for evaluating selective sampling effectiveness.
- LINEAR: Evaluates samples based on their consistency in showing steady improvements across training epochs. Using a threshold of θ=0.7, this method yields 1,189 samples.
Similar to deepseek r1, a rule-based reward function is used. Specifically, for a correct answer, the reward is 1; for an incorrect but properly formatted answer, the reward is -0.5; and for a answer with formatting errors, the reward is -1. Formally, this can be expressed as:
R(answer)={1amp;if the answer is correct, −0.5amp;if the answer is incorrect but well-formatted, −1amp;if the answer has formatting errors.
where:
- R(answer) is the reward for a given answer.
For the training setup, RL training is conducted using Proximal Policy Optimization (PPO) algorithm implemented in the OpenRLHF framework, using Qwen2.5-Math-7B as the initial policy model. The rollout batch size is 1,024, and 8 samples per prompt are generated with a temperature of 1.2 during exploration. The training process uses a batch size of 256, with learning rates set to 5e-7 and 9e-6 for the actor and critic models respectively, and a KL coefficient of 0.01. The experimental evaluations were conducted on MATH500, AIME2024 and AMC2023. To accelerate the evaluation process, the vLLM framework was utilized.
The main results reported are as follows:
- Directly applying RL to Qwen-Math-7B using the MATH-FULL dataset resulted in a significant performance improvement.
- Training with the MATH-RAND dataset results in an average accuracy drop of 8.1\% compared to using the full dataset, whereas MATH-LINEAR incurs only a ~2\% loss.
- LIMR, despite an 80\% reduction in dataset size, performs nearly on par with MATH-FULL.
- LIMR achieves performance comparable to MATH-FULL on all three benchmarks (AIME24, MATH500, AMC23), while significantly outperforming the RAND baseline.
- Compared to LIMO and s1, LIMR has achieved a relative improvement of over 100\% on AIME, and at least a 10\% accuracy increase on AMC23 and MATH500.
The authors conclude that the path to better reasoning capabilities may lie in optimizing sample quality rather than increasing data quantity and that RL, when combined with efficient data selection, can be particularly effective for smaller models with limited data.