ACECODER: Acing Coder RL via Automated Test-Case Synthesis (2502.01718v4)

Published 3 Feb 2025 in cs.SE, cs.AI, and cs.CL

Abstract: Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

Summary

The paper presents a novel pipeline that synthesizes 89K reliable test cases to generate effective reward signals for RL in code generation tasks.
The methodology leverages best-of-N sampling with curated reward models, achieving improvements of up to 10 points on benchmarks like HumanEval and MBPP.
The approach overcomes traditional RL challenges by automating test-case synthesis, paving the way for scalable enhancements in coder model performance.

AceCoder: Acing Coder RL via Automated Test-Case Synthesis

The paper "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" by Huaye Zeng et al., introduces a novel approach to enhancing the capabilities of coder models through the deployment of reinforcement learning (RL). This is accomplished by addressing a significant challenge within the domain: the scarcity of reliable reward signals necessary for effective RL training in code generation tasks. Traditional methods have predominantly utilized supervised fine-tuning (SFT), achieving notable success in code generation models. However, the potential of RL remains underexplored due to difficulties in obtaining consistent and accurate reward data.

Methodology and Contributions

The key contribution of this paper is the introduction of a comprehensive pipeline for synthesizing extensive test-case datasets designed to improve RL training for coder models. Specifically, the authors develop AceCode-89K, a large-scale dataset comprising 89,000 coding questions paired with reliable test cases. The dataset was constructed from existing code generation datasets, which were enhanced by using a LLM—GPT-4o-mini—to reshape questions in a style akin to LeetCode and fabricate plausible test cases. Noisy test cases were filtered using Qwen2.5-Coder-32B-Ins, ensuring the final set was both accurate and scalable.

The authors emphasize the training of reward models using best-of-N sampling on this curated dataset, demonstrating significant improvements—10 points for Llama-3.1-8B-Ins and 5 points for Qwen2.5-Coder-7B-Ins—compared to baseline models. Reinforcement learning was then performed using these reward models alongside test-case pass rewards, yielding consistent advancements across key benchmarks such as HumanEval and MBPP.

Results and Implications

The results underscore the efficacy of the AceCode-RM reward models, with significant improvements in test performance observed when employing best-of-N sampling strategies. Additionally, RL experiments illustrated the potential of the synthesized test cases to enhance model training, overcoming one of the major barriers to effective RL in code generation—the absence of verifiable reward signals.

This work lays the groundwork for further progress in utilizing RL for coder models by establishing a methodology to create the necessary reward signals. The improvements shown by RL-trained models reflect the vast potential for applying these methods to wide-ranging applications, particularly in environments where verifiable results, much like those in mathematical reasoning, are paramount.

Future Directions

The research opens several avenues for future exploration. Firstly, given that RL from scratch (R1-style training) demonstrated promising initial results with minimal optimization steps, further investigation could explore scaling these methods. Enhancing the robustness of reward models to prevent reward hacking remains another vital area. Lastly, adapting similar test synthesis and RL strategies across programming languages or diverse task domains, such as security, code optimization, and compliance, could expand the utility of such models significantly.

In conclusion, this paper advances the conversation in coding model improvement strategy by enabling the previously unexplored potential of reinforcement learning. It establishes a concrete pathway by automatic test-case synthesis, addressing one of the longstanding barriers in coder model refinement—the generation and utilization of reliable, scalable reward signals.