Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning (2506.03136v1)

Published 3 Jun 2025 in cs.CL

Abstract: We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

Summary

The paper presents CURE, a reinforcement learning framework that co-evolves code and unit test generation without relying on ground-truth solutions, achieving notable performance gains.
The paper introduces a reward precision metric that optimizes test discrimination by rewarding tests which pass correct code and fail incorrect code.
The paper demonstrates significant improvements across benchmarks, including up to 37.8% gain in unit test accuracy and a 64.8% reduction in response length for efficient inference.

This paper introduces CURE (Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning), a novel reinforcement learning framework designed to improve the coding abilities of LLMs by simultaneously enhancing their code generation and unit test generation capabilities. A key aspect of CURE is that it achieves this co-evolution without relying on ground-truth code solutions for supervision, instead leveraging the interaction outcomes between generated code and generated unit tests.

The core motivation is that unit tests are valuable for verifying code correctness and can serve as a reliable reward signal for training and test-time scaling. Traditional methods often require ground-truth code to train unit test generators, which is expensive. CURE proposes a framework where the code generator and unit test generator can mutually improve, with the unit tester learning from the coder's mistakes.

The paper formulates the objective for optimizing the unit test generator as maximizing "reward precision" – the probability that generated unit tests assign a higher reward to correct solutions than to incorrect ones. Based on a theoretical analysis modeling the execution outcomes of generated codes against generated tests, the paper derives a specific individual-level reward function ( $\mathcal{R}_{u_k}^{\star}$ ) for each generated unit test. This reward is designed to be positive when a unit test correctly fails incorrect code while passing correct code and negative otherwise. Simultaneously, code solutions are rewarded ( $\mathcal{R}_{s_j}^{\star}$ ) based on their ability to pass available ground-truth unit tests.

The CURE framework implements this co-evolution using a reinforcement learning loop (Algorithm 1):

For a given coding task, the policy LLM generates a batch of candidate code solutions ( $n$ ) and additional unit tests ( $m$ ).
An execution matrix is constructed by running the generated codes against both the generated unit tests and any available ground-truth unit tests.
Rewards $\mathcal{R}_{s_j}^{\star}$ are calculated for each code solution based on ground-truth test results.
Rewards $\mathcal{R}_{u_k}^{\star}$ are calculated for each generated unit test using the derived formula, leveraging the execution results against both correct and incorrect generated code solutions (as determined by ground-truth tests). This reward encourages unit tests that are accurate (pass correct code) and discriminative (fail incorrect code).
For long-Chain-of-Thought (CoT) models, a response-length-guided transformation is applied to the unit test reward to encourage generating shorter, more efficient tests during inference.
The policy LLM is then optimized iteratively using a reinforcement learning objective (like a PPO-style clipped objective) based on these collected rewards for both code and unit test generation trajectories.

The authors implement CURE using Qwen2.5-7B/14B-Instruct models as standard base models and Qwen3-4B as a Long-CoT base model, training on 4.5k coding problems from CodeContests. The resulting models, named ReasonFlux-Coder, are evaluated on five benchmarks: LiveBench, MBPP, LiveCodeBench, CodeContests, and CodeForces.

Experimental results demonstrate significant improvements:

ReasonFlux-Coder 7B and 14B models show substantial gains over their base Qwen2.5-Instruct models and existing coding-specific SFT models (Qwen2.5-Coder-Instruct, DeepSeek-Coder, Seed-Coder) in unit test accuracy (up to 37.8% average improvement), one-shot code generation accuracy (up to 5.3% average improvement), and Best-of-N (BoN) accuracy (up to 9.0% average improvement when using 16 generated codes and 16 generated tests).
The ReasonFlux-Coder-4B (Long-CoT) model consistently outperforms Qwen3-4B and achieves a notable 64.8% reduction in the average response length for unit test generation, leading to improved inference efficiency.
The trained ReasonFlux-Coder-4B model, when used as a unit tester for API models like GPT-4o-mini and GPT-4.1-mini, boosts their BoN accuracy and significantly reduces the API cost compared to scaling the API model alone.
The paper shows that the unit tests generated by ReasonFlux-Coder-4B can serve as an effective reward signal for RL training on a base model (Qwen2.5-14B-Instruct), achieving performance comparable to training with ground-truth unit tests, suggesting a path towards label-free RL for coding.
ReasonFlux-Coder-14B shows robust improvements (average 8.1%) when integrated into other test-time scaling and agentic coding pipelines (MPSC, AlphaCodium, S*) beyond simple BoN. It also improves performance on agentic unit test generation tasks (average 25.1%).
Ablation studies validate that the co-evolving approach and the specific theoretically-derived reward for unit tests are crucial for achieving the observed performance gains.

In summary, CURE offers a practical and scalable method for training LLMs to be proficient in both code and unit test generation by leveraging their interaction without relying on expensive ground-truth code supervision. The optimized models demonstrate state-of-the-art performance across various coding tasks and applications, including test-time scaling, agentic methods, and even label-free RL. The efficiency improvements for Long-CoT models further enhance their practical utility.