Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
89 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
29 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
90 tokens/sec
DeepSeek R1 via Azure Premium
55 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
207 tokens/sec
2000 character limit reached

LIMR: Less is More for RL Scaling (2502.11886v1)

Published 17 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing LLMs' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.

Summary

  • The paper demonstrates that a carefully selected subset of 1,389 RL training samples can achieve performance comparable to the full dataset of 8,523 samples.
  • It introduces LIM, a method that evaluates training samples via normalized alignment scores based on model learning trajectories.
  • LIMR outperforms baseline methods, achieving over 100% improvement on AIME24 and significant gains on AMC23 and MATH500 benchmarks.

The paper "LIMR: Less is More for RL Scaling" explores the effectiveness of reinforcement learning (RL) training data for enhancing LLMs' (LLM) reasoning capabilities. It addresses the lack of transparency regarding training data requirements in prior work, which has impeded systematic progress in the field. The authors challenge the assumption that scaling up RL training data inherently improves performance and demonstrate that a strategically selected subset of training samples can outperform the full dataset. They introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories.

The paper makes the following claims:

  • A carefully selected subset of RL training samples (1,389) can achieve comparable or superior performance compared to training with the full dataset (8,523), demonstrating that the quality and relevance of training samples matter more than their quantity.
  • LIM effectively predicts which samples will contribute most significantly to model improvement, eliminating the need for manual sample curation and making the methodology easily scalable.
  • Recent data-efficient approaches like LIMO and s1, which show promise with 32B-scale models via supervised fine-tuning (SFT), significantly underperform at 7B-scale. However, the RL-based LIMR achieves 16.7\% higher accuracy on AIME24 and surpasses LIMO and s1 by 13.0\% and 22.2\% on MATH500, suggesting that RL may be more effective for enhancing reasoning capabilities in data-sparse scenarios.

To quantify and optimize the value of training data in RL, the authors present LIM, which systematically analyzes learning dynamics to identify the most effective training samples. They conducted an extensive analysis using the MATH-FULL dataset, which contains 8,523 mathematical problems of varying difficulty levels (3-5), revealing that different training samples contribute unequally to model learning. The core of LIM centers on a model-aligned trajectory analysis that evaluates training samples based on their contribution to model learning. Given that neural network learning typically follows a logarithmic growth pattern, the model's average reward curve is used as a reference for measuring sample effectiveness. The model's average reward curve is defined as:

ravgk=1Ni=1Nrik,k=1,...,Kr_{\text{avg}^k = \frac{1}{N} \sum_{i=1}^{N} r_i^k, k=1, ..., K}

where:

  • rikr_i^k represents the reward of sample ii at epoch kk
  • NN is the total number of samples
  • KK is the total number of epochs

For each sample, LIM computes a normalized alignment score:

$\text{s}_i = 1 - \frac{\sum_{k=1}^{K} (r_i^k - r_{\text{avg}^k)^2}{\sum_{k=1}^{K} (1 - r_{\text{avg}^k)^2}, i=1, ..., N}$

where:

  • si\text{s}_i is the normalized alignment score for sample ii
  • rikr_i^k represents the reward of sample ii at epoch kk
  • ravgkr_{\text{avg}^k} is the average reward at epoch kk
  • KK is the total number of epochs
  • NN is the total number of samples

Based on the alignment scores, LIM implements a selective sampling strategy: si>θs_i > \theta, where θ\theta serves as a quality threshold. In the experiments, setting θ=0.6\theta = 0.6 yielded an optimized dataset (LIMR) of 1,389 high-value samples from the original dataset.

The paper compares LIM with several baseline data selection methods:

  • RAND: Randomly selects 1,389 samples from MATH-FULL to match the size of LIMR, serving as a reference point for evaluating selective sampling effectiveness.
  • LINEAR: Evaluates samples based on their consistency in showing steady improvements across training epochs. Using a threshold of θ=0.7\theta = 0.7, this method yields 1,189 samples.

Similar to deepseek r1, a rule-based reward function is used. Specifically, for a correct answer, the reward is 1; for an incorrect but properly formatted answer, the reward is -0.5; and for a answer with formatting errors, the reward is -1. Formally, this can be expressed as:

R(answer)={1amp;if the answer is correct, 0.5amp;if the answer is incorrect but well-formatted, 1amp;if the answer has formatting errors. R(\text{answer}) = \begin{cases} 1 & \text{if the answer is correct,} \ -0.5 & \text{if the answer is incorrect but well-formatted,} \ -1 & \text{if the answer has formatting errors.} \end{cases}

where:

  • R(answer)R(\text{answer}) is the reward for a given answer.

For the training setup, RL training is conducted using Proximal Policy Optimization (PPO) algorithm implemented in the OpenRLHF framework, using Qwen2.5-Math-7B as the initial policy model. The rollout batch size is 1,024, and 8 samples per prompt are generated with a temperature of 1.2 during exploration. The training process uses a batch size of 256, with learning rates set to 5e-7 and 9e-6 for the actor and critic models respectively, and a KL coefficient of 0.01. The experimental evaluations were conducted on MATH500, AIME2024 and AMC2023. To accelerate the evaluation process, the vLLM framework was utilized.

The main results reported are as follows:

  • Directly applying RL to Qwen-Math-7B using the MATH-FULL dataset resulted in a significant performance improvement.
  • Training with the MATH-RAND dataset results in an average accuracy drop of 8.1\% compared to using the full dataset, whereas MATH-LINEAR incurs only a ~2\% loss.
  • LIMR, despite an 80\% reduction in dataset size, performs nearly on par with MATH-FULL.
  • LIMR achieves performance comparable to MATH-FULL on all three benchmarks (AIME24, MATH500, AMC23), while significantly outperforming the RAND baseline.
  • Compared to LIMO and s1, LIMR has achieved a relative improvement of over 100\% on AIME, and at least a 10\% accuracy increase on AMC23 and MATH500.

The authors conclude that the path to better reasoning capabilities may lie in optimizing sample quality rather than increasing data quantity and that RL, when combined with efficient data selection, can be particularly effective for smaller models with limited data.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com