Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset (2505.21297v1)

Published 27 May 2025 in cs.CL

Abstract: Advancing code reasoning in LLMs is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.

Summary

  • The paper presents the rStar-Coder dataset, featuring 418K verified competitive problems and 580K reasoning solutions for training advanced code models.
  • It introduces a novel methodology with mutual verification and a three-step input generation process to ensure diverse and reliable test cases.
  • Experimental results demonstrate that fine-tuning on rStar-Coder significantly improves code reasoning benchmarks, outperforming larger baseline models.

The paper "rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset" (2505.21297) introduces a novel approach and dataset to improve the code reasoning capabilities of LLMs. The core challenge it addresses is the scarcity of high-difficulty, large-scale datasets with verifiable input-output test cases necessary for rigorous training and evaluation of LLMs on competitive programming problems.

The paper's main contribution is the creation of the rStar-Coder dataset, which comprises 418K unique competitive-level code problems and 580K long-reasoning solutions, each validated with diverse synthetic test cases across varying difficulty levels. This is achieved through three key components:

  1. Curating and Synthesizing Problems: The authors curate 37.7K high-quality, expert-written problems with oracle solutions from competitive programming platforms like IOI, Codeforces, and USACO. These served as seed problems. To overcome the limitations of directly prompting LLMs to generate new problems from seed problems (which often results in unsolvable or invalid problems), they design structured prompts that incorporate both the seed problem and its oracle solution. This guides the LLM (GPT-4o) to understand the core algorithmic concepts and generate new, solvable problems that test similar skills. This process resulted in synthesizing an additional 1.56 million candidate problems, which were later filtered.
  2. Reliable Input-Output Test Case Synthesis: A major challenge is generating reliable and diverse test cases, especially for synthesized problems that lack ground truth solutions. The paper decouples this process into two stages:
    • Valid Test Input Generation: A three-step approach is proposed. First, GPT-4o is prompted to generate two utility functions per problem: generate_test_input (which uses the CYaRon library for structured input generation based on scale parameters) and validate_test_input (to check if the generated input satisfies problem constraints). Second, input scale ranges (e.g., 10010^0 to 10510^5) are defined for the parameters exposed by the generation function. Third, the utility functions are executed with instantiated scale values to produce diverse, constraint-satisfying test inputs. This method is shown to generate inputs covering a much wider range of scales and complexities compared to direct LLM prompting.
    • Mutual Verification for Output Labeling: For synthetic problems without oracle solutions, a mutual verification mechanism is introduced. Multiple long-reasoning solutions (16 candidates from QWQ-32B) are sampled and executed on the same set of diverse test inputs. If a majority of these candidate solutions produce identical outputs across all inputs, both the consistent set of outputs and the agreeing solutions are accepted as correct. This works because incorrect solutions tend to diverge in errors, while correct ones converge. For seed problems, the oracle solution is used to generate ground truth outputs on the diverse test inputs.
  3. Augmentation with Verified Long-Reasoning Solutions: Expert-written seed problems often lack detailed reasoning steps. Using the generated diverse test cases for verification, the authors prompt QWQ-32B to generate 16 long Chain-of-Thought (CoT) solutions per seed problem and filter for those that pass all tests. For synthetic problems, which have no oracle solutions, the solutions accepted by the mutual verification mechanism serve as the verified long-reasoning solutions.

The dataset is post-processed by removing unsolvable/overly difficult synthetic problems (where majority agreement for mutual verification is below a threshold), selecting the fastest verified solution per synthetic problem for efficiency, and performing decontamination against standard evaluation benchmarks (HumanEval, MBPP, LiveCodeBench, USACO 2025). The final dataset contains 418K verified problems (37.7K expert, 380K synthetic) and 580K question-solution pairs.

The effectiveness of rStar-Coder is demonstrated by fine-tuning Qwen2.5-Coder models (1.5B, 7B, 14B) and evaluating them on various code reasoning and generation benchmarks. Key findings include:

  • rStar-Coder significantly improves the code reasoning capabilities of base models. On LiveCodeBench, rStar-Coder-7B achieves 57.3% (a +39.9 point improvement over the base Qwen2.5-Coder-7B), surpassing much larger models like R1-Distill-Qwen-32B (57.2%). rStar-Coder-14B reaches 62.5% (a +39.2 point improvement), outperforming all open-source baselines listed and even o3-mini (low) (59.4%). The 1.5B model reaches 40.1%, outperforming GPT-4o (30.0%) and R1-Distill-Qwen-1.5B (16.9%).
  • On the challenging USACO 2025 benchmark, rStar-Coder-7B (16.15%) and rStar-Coder-14B (17.19%) outperform the frontier model QWQ-32B (15.62%), which was used to generate the long-reasoning solutions in the dataset. This highlights the value of the diverse, verified dataset in enabling smaller models to achieve strong reasoning performance.
  • rStar-Coder also generalizes well to standard code generation tasks (HumanEval, MBPP), consistently improving base model performance, with rStar-Coder-7B achieving performance comparable to Claude3.5-Sonnet on these benchmarks.

Ablation studies confirm the contribution of both curated and synthetic data sources and the effectiveness of the mutual verification (96.8% accuracy in output labeling) and three-step input generation methods (generating more diverse and higher-scale inputs). An analysis of scaling dimensions shows that increasing problem diversity (as in rStar-Coder) is more effective and efficient for improving reasoning performance than merely increasing the number of solutions per problem on a fixed set of problems.

In conclusion, rStar-Coder addresses a critical data bottleneck in training advanced code reasoning LLMs by providing a large-scale, verified dataset derived from expert-curated and synthetically generated competitive programming problems. The proposed methods for reliable test case synthesis, particularly the mutual verification mechanism, are key to creating this high-quality data. The results demonstrate that training on this dataset enables smaller LLMs to achieve state-of-the-art performance on challenging code reasoning tasks. Future work includes expanding the dataset further.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com