Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 172 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward (2509.01321v1)

Published 1 Sep 2025 in cs.LG and cs.CL

Abstract: Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large datasets, leading to high training costs and low data efficiency. To mitigate this issue, we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty. During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential, thereby reducing substantial rollout computational costs. Furthermore, we incorporate a replay mechanism for under-explored samples to ensure adequate training, which enhances the model's final convergence performance. Experiments across five reasoning benchmarks show that DEPO consistently outperforms existing methods in both offline and online data selection scenarios. Notably, using only 20% of the training data, our approach achieves a 1.85 times speed-up on AIME24 and a 1.66 times speed-up on AIME25 compared to GRPO trained on the full dataset.

Summary

The paper introduces a novel two-stage data selection pipeline combining offline PageRank-weighted DPP and online explorability metrics to boost data efficiency.
The paper demonstrates substantial speed-ups, achieving factors of 1.85 and 1.66 on AIME24 and AIME25 benchmarks respectively, while maintaining performance.
The paper validates the effectiveness of its approach through ablation studies, confirming that both dynamic replay and precise sample selection are critical for reduced computational load.

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward

Introduction

The paper addresses the challenge of enhancing data efficiency in Reinforcement Learning with Verifiable Rewards (RLVR), which is crucial in optimizing the training process of LLMs in reasoning tasks. With the growing compute demand and dataset size, traditional methods tend to incur significant computational costs. The proposed approach in this paper focuses on selecting high-quality subsets of data, both offline and online, to streamline the training pipeline.

Figure 1: AIME24 benchmark results showing significant speed-up with the proposed method using 20% training data.

Methodology

The methodology is centered around a two-stage data selection pipeline designed to reduce redundancy and computational overhead while maintaining model performance.

Offline Data Selection

The offline strategy involves constructing a sample graph based on feature representations and applying a PageRank-weighted Determinantal Point Process (DPP) technique. This method ensures that samples retained are both diverse and influential. After pruning, further refinement is achieved by selecting samples such that their difficulty levels follow a normal distribution.

Figure 2: The overview of the proposed approach. Efficient pipeline through offline and online data selection.

Online Data Selection

During the RLVR training, a sample-level explorability metric is introduced to guide rollout pruning, prioritizing samples with high exploration potential. Moreover, a dynamic replay mechanism is employed to revisit under-explored samples, ensuring a balanced training progression across all samples.

Experimental Results

Experiments on several reasoning benchmarks underline the effectiveness of this approach. The model achieves comparable performance to traditional methods while drastically reducing the amount of required data and rollout computation. Detailed benchmark results demonstrated speed-up factors of 1.85 and 1.66 on AIME24 and AIME25, respectively.

Figure 3: Entropy dynamics during training showcasing enhanced sampling efficiency.

Figure 4: Offline Selection method utilizing PageRank-weighted determinantal point processes.

Figure 5: Comparison between different sampling strategies.

Ablation Studies

Additional studies confirmed the critical role each component plays, from the determinantal point process in offline selection to explorability metrics for rollout pruning. The method's robustness across varying dataset sizes and sampling ratios illustrates its adaptability and efficiency.

Conclusion

The proposed data-efficient policy optimization pipeline effectively reduces computational burden while accelerating RLVR training. This approach holds promise for future deployments in large-scale ML systems, paving the way for more efficient AI training methodologies. Detailed experiments confirm its effectiveness across different benchmarks and model configurations, establishing a foundational approach to data-efficient AI training.

The insights from this paper are set to influence future research on improving data efficiency, particularly in reinforcement learning applications involving LLMs, as the demand for reasoning capabilities in real-world applications grows.