CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models (2503.22342v1)

Published 28 Mar 2025 in cs.AI

Abstract: This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.

Summary

The paper introduces CPPO, a method accelerating GRPO-based reasoning model training by pruning completions with low advantages to reduce computational overhead.
Experiments show CPPO achieves up to 8.32x speedup on GSM8K and 3.51x on Math datasets compared to GRPO, maintaining or improving accuracy.
Practically, CPPO offers an efficient approach for scaling large reasoning models by reducing training costs and computational resources.

Completion Pruning Policy Optimization for Efficient Reasoning Model Training

The paper introduces Completion Pruning Policy Optimization (CPPO), an advanced method designed to accelerate the training process of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO is known for its effectiveness in training reasoning models through reinforcement learning, specifically by leveraging step-by-step inference. However, its reliance on sampling multiple completions per question imposes significant computational overhead. CPPO addresses this by introducing a more efficient training mechanism without compromising model accuracy.

Core Contributions

Completion Pruning Strategy: CPPO is characterized by its ability to prune completions with low absolute advantages before the gradient calculation and update processes. This effectively reduces the number of completions that the model must evaluate, thereby decreasing training time. Theoretical analysis in the paper reveals that not all completions contribute equally to policy training; their relative advantage is a crucial metric that CPPO exploits to optimize the training pipeline.
Dynamic Completion Allocation: CPPO incorporates a dynamic strategy to maximize GPU utilization. By dynamically incorporating additional questions and optimizing the completion allocation, the GPU resources are utilized more efficiently, further enhancing the training speed. This method ensures that computational resources are not underutilized post-pruning, thereby streamlining the entire training process.

Experimental Results

The experimental validation of CPPO demonstrates substantial improvements in training efficiency. On the GSM8K dataset, CPPO achieves up to an 8.32× speedup compared to GRPO, while on the Math dataset, the speedup is up to 3.51×. Crucially, these performance gains are achieved without sacrificing accuracy; in some cases, CPPO even enhances it. This indicates that the pruning strategy does not merely discard potentially useful data but rather focuses the model’s attention on the most informative samples.

Theoretical and Practical Implications

The introduction of CPPO contributes significantly to theoretical advancements in reinforcement learning applications for reasoning models. It emphasizes the importance of selective sample utilization based on their contribution to model learning, showcasing how reinforcement learning techniques can be adapted to improve efficiency in large-scale AI systems.

Practically, CPPO offers an efficient approach to scaling reasoning models. In a landscape where computational resources and training costs are crucial constraints, CPPO provides a framework that reduces overhead while maintaining or enhancing model performance. This has direct implications for deploying large-scale models in real-world applications where both accuracy and efficiency are paramount.

Future Directions

CPPO opens several avenues for future research. Its principles might be adapted to other domains within AI where similar challenges of sample utilization and computational efficiency exist. The scalability of CPPO to even larger models and a broader variety of tasks presents a promising direction. Furthermore, integrating CPPO with other reinforcement learning algorithms like Proximal Policy Optimization (PPO) or REINFORCE++ could yield more insights into optimizing training processes for complex AI tasks.

Overall, CPPO stands as a significant advancement in optimizing reasoning model training, balancing computational efficiency with model accuracy, and paving the way for more resource-efficient AI models.

Related Papers

GitHub

GitHub - lzhxmu/CPPO (8 stars)

Tweets

https://twitter.com/Synced_Global/status/1906586141412958350

https://twitter.com/Niccolg92/status/1908140031530709427

YouTube

Show All Videos

Reddit

[2503.22342] CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models (1 point, 0 comments)