- The paper presents the rStar-Coder dataset, featuring 418K verified competitive problems and 580K reasoning solutions for training advanced code models.
- It introduces a novel methodology with mutual verification and a three-step input generation process to ensure diverse and reliable test cases.
- Experimental results demonstrate that fine-tuning on rStar-Coder significantly improves code reasoning benchmarks, outperforming larger baseline models.
The paper "rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset" (2505.21297) introduces a novel approach and dataset to improve the code reasoning capabilities of LLMs. The core challenge it addresses is the scarcity of high-difficulty, large-scale datasets with verifiable input-output test cases necessary for rigorous training and evaluation of LLMs on competitive programming problems.
The paper's main contribution is the creation of the rStar-Coder dataset, which comprises 418K unique competitive-level code problems and 580K long-reasoning solutions, each validated with diverse synthetic test cases across varying difficulty levels. This is achieved through three key components:
- Curating and Synthesizing Problems: The authors curate 37.7K high-quality, expert-written problems with oracle solutions from competitive programming platforms like IOI, Codeforces, and USACO. These served as seed problems. To overcome the limitations of directly prompting LLMs to generate new problems from seed problems (which often results in unsolvable or invalid problems), they design structured prompts that incorporate both the seed problem and its oracle solution. This guides the LLM (GPT-4o) to understand the core algorithmic concepts and generate new, solvable problems that test similar skills. This process resulted in synthesizing an additional 1.56 million candidate problems, which were later filtered.
- Reliable Input-Output Test Case Synthesis: A major challenge is generating reliable and diverse test cases, especially for synthesized problems that lack ground truth solutions. The paper decouples this process into two stages:
- Valid Test Input Generation: A three-step approach is proposed. First, GPT-4o is prompted to generate two utility functions per problem:
generate_test_input
(which uses the CYaRon library for structured input generation based on scale parameters) and validate_test_input
(to check if the generated input satisfies problem constraints). Second, input scale ranges (e.g., 100 to 105) are defined for the parameters exposed by the generation function. Third, the utility functions are executed with instantiated scale values to produce diverse, constraint-satisfying test inputs. This method is shown to generate inputs covering a much wider range of scales and complexities compared to direct LLM prompting.
- Mutual Verification for Output Labeling: For synthetic problems without oracle solutions, a mutual verification mechanism is introduced. Multiple long-reasoning solutions (16 candidates from QWQ-32B) are sampled and executed on the same set of diverse test inputs. If a majority of these candidate solutions produce identical outputs across all inputs, both the consistent set of outputs and the agreeing solutions are accepted as correct. This works because incorrect solutions tend to diverge in errors, while correct ones converge. For seed problems, the oracle solution is used to generate ground truth outputs on the diverse test inputs.
- Augmentation with Verified Long-Reasoning Solutions: Expert-written seed problems often lack detailed reasoning steps. Using the generated diverse test cases for verification, the authors prompt QWQ-32B to generate 16 long Chain-of-Thought (CoT) solutions per seed problem and filter for those that pass all tests. For synthetic problems, which have no oracle solutions, the solutions accepted by the mutual verification mechanism serve as the verified long-reasoning solutions.
The dataset is post-processed by removing unsolvable/overly difficult synthetic problems (where majority agreement for mutual verification is below a threshold), selecting the fastest verified solution per synthetic problem for efficiency, and performing decontamination against standard evaluation benchmarks (HumanEval, MBPP, LiveCodeBench, USACO 2025). The final dataset contains 418K verified problems (37.7K expert, 380K synthetic) and 580K question-solution pairs.
The effectiveness of rStar-Coder is demonstrated by fine-tuning Qwen2.5-Coder models (1.5B, 7B, 14B) and evaluating them on various code reasoning and generation benchmarks. Key findings include:
- rStar-Coder significantly improves the code reasoning capabilities of base models. On LiveCodeBench, rStar-Coder-7B achieves 57.3% (a +39.9 point improvement over the base Qwen2.5-Coder-7B), surpassing much larger models like R1-Distill-Qwen-32B (57.2%). rStar-Coder-14B reaches 62.5% (a +39.2 point improvement), outperforming all open-source baselines listed and even o3-mini (low) (59.4%). The 1.5B model reaches 40.1%, outperforming GPT-4o (30.0%) and R1-Distill-Qwen-1.5B (16.9%).
- On the challenging USACO 2025 benchmark, rStar-Coder-7B (16.15%) and rStar-Coder-14B (17.19%) outperform the frontier model QWQ-32B (15.62%), which was used to generate the long-reasoning solutions in the dataset. This highlights the value of the diverse, verified dataset in enabling smaller models to achieve strong reasoning performance.
- rStar-Coder also generalizes well to standard code generation tasks (HumanEval, MBPP), consistently improving base model performance, with rStar-Coder-7B achieving performance comparable to Claude3.5-Sonnet on these benchmarks.
Ablation studies confirm the contribution of both curated and synthetic data sources and the effectiveness of the mutual verification (96.8% accuracy in output labeling) and three-step input generation methods (generating more diverse and higher-scale inputs). An analysis of scaling dimensions shows that increasing problem diversity (as in rStar-Coder) is more effective and efficient for improving reasoning performance than merely increasing the number of solutions per problem on a fixed set of problems.
In conclusion, rStar-Coder addresses a critical data bottleneck in training advanced code reasoning LLMs by providing a large-scale, verified dataset derived from expert-curated and synthetically generated competitive programming problems. The proposed methods for reliable test case synthesis, particularly the mutual verification mechanism, are key to creating this high-quality data. The results demonstrate that training on this dataset enables smaller LLMs to achieve state-of-the-art performance on challenging code reasoning tasks. Future work includes expanding the dataset further.