Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning (2507.17548v1)

Published 23 Jul 2025 in cs.SE

Abstract: Code reasoning is a fundamental capability for LLMs in the code domain. It involves understanding and predicting a program's execution behavior, such as determining the output for a given input or whether a specific statement will be executed. This capability is essential for downstream tasks like debugging, code generation, and program repair. Prior approaches mainly rely on supervised fine-tuning to improve performance in code reasoning tasks. However, they often show limited gains and fail to generalize across diverse scenarios. We argue this is due to two core issues: the low quality of training data and the limitations of supervised fine-tuning, which struggles to teach general reasoning skills. To address these challenges, we propose CodeReasoner, a framework that spans both dataset construction and a two-stage training process. First, we introduce a method to construct datasets that focus on the core execution logic of Python programs. Next, we apply instruction tuning to inject execution-specific knowledge distilled from a powerful teacher model. We then enhance reasoning and generalization through GRPO reinforcement learning on top of the fine-tuned model. Experiments on three widely-used code reasoning benchmarks show that CodeReasoner improves performance by 27.1% to 40.2% over prior methods using a 7B model. Notably, the 7B model matches GPT-4o on key tasks like input/output and coverage prediction. When scaled to 14B, CodeReasoner outperforms GPT-4o across all benchmarks. Ablation studies confirm the effectiveness of each training stage and highlight the importance of reasoning chains.

Summary

  • The paper presents CodeReasoner, which integrates a novel dataset construction technique with a two-stage training pipeline of instruction tuning and GRPO-based reinforcement learning.
  • It demonstrates significant performance gains on code execution simulation tasks, outperforming both open-source and proprietary baselines in small and medium models.
  • The approach reduces overthinking and verbosity while enhancing model generalization across diverse architectures, making it effective for debugging, code generation, and program repair.

CodeReasoner: Enhancing Code Reasoning in LLMs with Reinforcement Learning

Introduction

The paper introduces CodeReasoner, a framework designed to improve the code reasoning capabilities of LLMs, particularly in small and medium parameter regimes. Code reasoning, defined as the ability to understand and predict program execution behavior, is critical for downstream tasks such as debugging, code generation, and program repair. The authors identify two primary limitations in existing approaches: (1) the prevalence of low-quality, boilerplate-heavy training data, and (2) the limited generalization of models trained solely via supervised fine-tuning (SFT). To address these, CodeReasoner combines a novel dataset construction methodology with a two-stage training pipeline: instruction tuning followed by reinforcement learning using Group-relative Policy Optimization (GRPO).

Analysis of LLM Failures in Code Reasoning

The authors conduct a detailed error analysis on the Qwen2.5-7B model using the Cruxeval benchmark. They find that most failures occur during the execution simulation phase, not code comprehension. LLMs often mismanage state tracking, especially with complex data structures or control flow, due to their reliance on statistical patterns rather than explicit execution traces. Figure 1

Figure 1: A motivating example from Cruxeval where the LLM comprehends code but fails in execution simulation, mispredicting the output due to incorrect state tracking.

This analysis underscores the necessity for training data and objectives that directly target execution simulation, rather than merely code comprehension or generation.

Limitations of Existing Datasets

Current code reasoning datasets, such as those used in SEMCODER and CODEI/O, are often constructed by expanding real-world code snippets into full programs, resulting in excessive boilerplate and superficial complexity. Figure 2

Figure 2: Example from PXY-R showing that boilerplate code (in red) adds unnecessary complexity, obscuring the core logic relevant for reasoning.

Such datasets fail to provide strong training signals for execution simulation, as the core reasoning challenge is often trivialized by the surrounding scaffolding.

CodeReasoner Framework

Dataset Construction

The dataset construction process is algorithmically controlled to ensure conciseness, comprehensiveness, controllability, and varied difficulty. The process iterates over Python built-in types and methods, applying constraints on method calls, control structures, and input mutations. Each generated test case is validated by execution, and only concise, valid cases are retained. Figure 3

Figure 3: Example test case from the constructed dataset, demonstrating improved conciseness and focus on execution logic compared to prior work.

This approach yields a dataset that is both diverse and directly targeted at the core challenges of code execution reasoning.

Two-Stage Training: Instruction Tuning and Reinforcement Learning

Instruction Tuning

Instruction tuning is performed using chain-of-thought (CoT) traces distilled from a strong teacher model (QwQ-32B). Both forward (input-to-output) and backward (output-to-input) reasoning tasks are included. Rejection sampling ensures only valid, executable reasoning traces are used. This stage injects domain-specific reasoning patterns into the student model, which is essential for subsequent RL.

Reinforcement Learning with GRPO

The second stage applies GRPO, a memory-efficient RL algorithm that optimizes for concise, correct reasoning chains. The reward function is binary, granting positive reward only for correct, non-verbose answers. GRPO operates by generating multiple candidate responses per prompt, scoring them, and updating the policy based on group-relative advantages, with KL regularization to stabilize training. Figure 4

Figure 4: Case paper demonstrating the effectiveness of the two-stage training pipeline, with clear improvements in both forward and backward reasoning after RL.

This RL stage addresses two key issues: (1) it mitigates the tendency of instruction-tuned models to produce overly long or repetitive reasoning chains, and (2) it enhances generalization to unseen reasoning tasks.

Experimental Results

Benchmarks and Baselines

CodeReasoner is evaluated on CRUXEval, LiveCodeBench, and REval, covering both high-level and fine-grained code reasoning tasks. Baselines include state-of-the-art closed-source models (GPT-4o, GPT-4o-mini), large open-source models (Qwen2.5-72B, Llama3-70B), and specialized code reasoning models (SEMCODER, CODEI/O).

Performance

  • On input/output and output/input prediction tasks, CodeReasoner-7B matches or exceeds GPT-4o-mini and outperforms all open-source baselines, with average improvements of 16.4% over GPT-4o-mini and 32.8% over specialized 7B baselines.
  • On fine-grained reasoning tasks (REval), CodeReasoner-7B outperforms similarly sized models by 21.9%–39.6% and even surpasses GPT-4o in several metrics. The 14B variant outperforms GPT-4o on average.
  • Ablation studies confirm that both instruction tuning and RL are necessary; RL alone is ineffective without prior domain knowledge, and omitting reasoning chains leads to substantial performance drops. Figure 5

Figure 5

Figure 5

Figure 5: Training dynamics during RL: mean reward increases, average response length decreases, and the clipping ratio (overly long responses) drops, indicating improved conciseness and correctness.

Model Generalization

The framework is shown to be effective across different model architectures (Qwen2.5, Llama3) and sizes (7B, 8B, 14B), with relative improvements of 17.8%–31.0%.

Discussion

The RL stage, particularly with GRPO, is effective at reducing overthinking and verbosity, a problem exacerbated by instruction tuning. Unlike RL in mathematical domains, where response length often increases, here the average response length decreases as the model learns to avoid unnecessary reasoning steps. This is attributed to the task-specific reward structure and the initial tendency of instruction-tuned models to overgenerate.

Implications and Future Directions

Practical Implications

  • For practitioners, CodeReasoner demonstrates that small and medium LLMs can achieve code reasoning performance on par with much larger or proprietary models, provided that training data and objectives are carefully designed.
  • For tool developers, the approach is directly applicable to building more reliable AI-powered debugging, code review, and program repair assistants, especially in resource-constrained environments.

Theoretical Implications

  • The results highlight the limitations of SFT for generalization in code reasoning and the necessity of RL for robust, concise reasoning.
  • The findings suggest that dataset construction methodologies that focus on execution logic, rather than code structure, are critical for advancing LLM reasoning capabilities.

Future Work

  • Extending the dataset and training pipeline to additional programming languages and multilingual benchmarks.
  • Scaling the approach to larger models and more diverse architectures.
  • Integrating CodeReasoner into end-to-end developer tools for debugging and automated repair.

Conclusion

CodeReasoner presents a comprehensive solution to the challenges of code reasoning in LLMs, combining principled dataset construction with a two-stage training pipeline. The approach yields substantial improvements over both open-source and proprietary baselines, particularly in small and medium model regimes. The work establishes a new standard for code reasoning evaluation and provides a blueprint for future research in LLM-based software engineering tools.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com