Papers
Topics
Authors
Recent
2000 character limit reached

Qwen3-Coder-Max: Advanced Code Agent

Updated 29 December 2025
  • Qwen3-Coder-Max is a flagship coding agent that uses advanced reward modeling to generate high-quality code trajectories for realistic software engineering tasks.
  • It achieves a baseline pass@1 rate of 67.0% on SWE benchmarks, which improves to 74.6% with the application of execution-free reward modeling.
  • Its innovative use of continuous score feedback enables fine-grained performance calibration and effective handling of long-context, multi-turn code patch generation.

Qwen3-Coder-Max is a flagship coding agent from the Qwen3 model series, designed as an advanced LLM for realistic software engineering tasks. Distinguished primarily by its role and empirical results in open-source coding agent benchmarks, Qwen3-Coder-Max serves as a black-box policy for trajectory generation and as a data source in the development of state-of-the-art reward models for software engineering agents (Shum et al., 26 Dec 2025).

1. Position and Baseline Performance

Qwen3-Coder-Max is identified as one of the highest performing open-source 30B-scale code agents. In controlled experiments on the SWE-Bench Verified benchmark—which comprises 500 human-verified GitHub issue resolution tasks—the model demonstrates a baseline pass@1 rate of 67.0% under a rigorous test-time scaling (TTS) protocol (using k=32k=32 sampled runs per instance with temperature=1.0, top_p=0.95). This performance places it among the strongest models in its class prior to the application of auxiliary reward modeling (Shum et al., 26 Dec 2025).

2. Application of SWE-RM: Execution-Free Reward Modeling

SWE-RM is a Mixture-of-Experts (MoE) reward model (30B total parameters, 3B per forward pass) introduced to give fine-grained, execution-free feedback for software engineering agents. In TTS, Qwen3-Coder-Max produces 32 candidate solutions per instance, which are then scored by SWE-RM; the highest-scoring patch is selected. This reranking procedure increases Qwen3-Coder-Max’s pass@1 on SWE-Bench Verified from 67.0% to 74.6%, a gain of 7.6 percentage points. The reward model’s scoring for each trajectory is continuous in [0,1], derived from logit-transformed probabilities for <YES> or <NO> tokens:

rEF(T)=exp(yes)exp(yes)+exp(no)r_{EF}(T) = \frac{\exp(\ell_{yes})}{\exp(\ell_{yes}) + \exp(\ell_{no})}

where yes=logP(YEST)\ell_{yes} = \log P(\langle YES \rangle \mid T), no=logP(NOT)\ell_{no} = \log P(\langle NO \rangle \mid T), and TT is the code-fix trajectory.

Model Baseline RM@32 RM@32 with SWE-RM AUC ECE
Qwen3-Coder-Max 67.0% 74.6% 0.752→0.768 0.283→0.047

AUC (area under the ROC curve) and ECE (expected calibration error) further improve after applying SWE-RM, indicating better ranking and calibration of success probabilities.

3. Evaluation Protocol and Metrics

Experiments use the SWE-Bench Verified dataset, comprising 500 real-world GitHub issues, each evaluated with 32 independent samples per model instance, totaling 16,000 trajectories. The reward model training leverages approximately 100,000 trajectories collected from multiple policy models (including Qwen3-Coder-Max and external models) deployed within the OpenHands and SWE-Agent scaffolding systems, and sources data from four major datasets (SWE-Gym, SWE-rebench, SWE-smith, R2E-Gym). Data mixture (on-policy and off-policy) and a positive:negative ratio of 2:1 in training yield optimal calibration and ranking performance. Model context lengths are extended up to 256k tokens, ensuring that complex, long multi-turn patches are retained without truncation.

4. Theoretical Rationale for Performance Uplift

Traditional test-case-based (execution-based) verifiers only provide binary signals—pass or fail—which result in sparse or ambiguous feedback during both TTS and reinforcement learning (RL). SWE-RM addresses this by generating a continuous reward for every trajectory, thus enabling:

  • Fine-grained feedback: Scores that distinguish otherwise indistinguishable successes or failures among candidate trajectories.
  • Strong discrimination: High AUC values ensure that resolutions consistently rank above non-resolutions, yielding more reliable gradient signals during RL.
  • Excellent calibration: Low ECE aligns reward magnitudes with the true probability of task resolution, preventing training instabilities due to over- or under-confidence.
  • Enhanced generalization: Large-scale, mixed-policy/source, and long-context training data reduce out-of-distribution (OOD) errors, especially for Qwen3-Coder-Max’s multi-file, multi-turn patches (Shum et al., 26 Dec 2025).

5. Reinforcement Learning with Hybrid Reward

Qwen3-Coder-Max itself is not RL-trained in the original experiments; however, a similar 30B policy is used to demonstrate hybrid reward efficacy. The reward combines unit-test pass/fail signals with the execution-free SWE-RM score:

rhybrid(q,T,patch)={1+rEF,if unit-test resolves 0.5+rEF,if interaction unfinished 0+rEF,otherwiser_{hybrid}(q, T, patch) = \begin{cases} 1 + r_{EF}, & \text{if unit-test resolves} \ 0.5 + r_{EF}, & \text{if interaction unfinished} \ 0 + r_{EF}, & \text{otherwise} \end{cases}

On SWE-Bench Verified, this approach increases pass@1 from 51.8% (execution-based only) to 54.8% for the alternative model, illustrating SWE-RM’s utility in RL contexts.

6. Limitations and Documentation Status

The “Qwen3 Technical Report” (Yang et al., 14 May 2025) does not document Qwen3-Coder-Max in its architecture, training protocols, or evaluation sections. No additional pretraining, instruction tuning, or implementation specifics are available for Qwen3-Coder-Max in current open-source technical reports beyond its use as a black-box policy and benchmark agent in SWE-RM evaluations. All claims regarding Qwen3-Coder-Max’s behavior, benchmark status, and performance derive specifically from SWE-RM experimental results (Shum et al., 26 Dec 2025). Any further architectural or methodological details would require separate documentation.

7. Context and Significance

Qwen3-Coder-Max, as evaluated with SWE-RM, establishes a new state-of-the-art among open-source agents in the 30B parameter regime for realistic software engineering tasks. Its performance showcases the benefits of execution-free, calibrated reward modeling for scaling coding agents to challenging, long-context, multi-turn settings, and motivates ongoing research into the synergy of large code LLMs and advanced reward modeling frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Qwen3-Coder-Max.