Qwen3-Coder-Max: Advanced Code Agent
- Qwen3-Coder-Max is a flagship coding agent that uses advanced reward modeling to generate high-quality code trajectories for realistic software engineering tasks.
- It achieves a baseline pass@1 rate of 67.0% on SWE benchmarks, which improves to 74.6% with the application of execution-free reward modeling.
- Its innovative use of continuous score feedback enables fine-grained performance calibration and effective handling of long-context, multi-turn code patch generation.
Qwen3-Coder-Max is a flagship coding agent from the Qwen3 model series, designed as an advanced LLM for realistic software engineering tasks. Distinguished primarily by its role and empirical results in open-source coding agent benchmarks, Qwen3-Coder-Max serves as a black-box policy for trajectory generation and as a data source in the development of state-of-the-art reward models for software engineering agents (Shum et al., 26 Dec 2025).
1. Position and Baseline Performance
Qwen3-Coder-Max is identified as one of the highest performing open-source 30B-scale code agents. In controlled experiments on the SWE-Bench Verified benchmark—which comprises 500 human-verified GitHub issue resolution tasks—the model demonstrates a baseline pass@1 rate of 67.0% under a rigorous test-time scaling (TTS) protocol (using sampled runs per instance with temperature=1.0, top_p=0.95). This performance places it among the strongest models in its class prior to the application of auxiliary reward modeling (Shum et al., 26 Dec 2025).
2. Application of SWE-RM: Execution-Free Reward Modeling
SWE-RM is a Mixture-of-Experts (MoE) reward model (30B total parameters, 3B per forward pass) introduced to give fine-grained, execution-free feedback for software engineering agents. In TTS, Qwen3-Coder-Max produces 32 candidate solutions per instance, which are then scored by SWE-RM; the highest-scoring patch is selected. This reranking procedure increases Qwen3-Coder-Max’s pass@1 on SWE-Bench Verified from 67.0% to 74.6%, a gain of 7.6 percentage points. The reward model’s scoring for each trajectory is continuous in [0,1], derived from logit-transformed probabilities for <YES> or <NO> tokens:
where , , and is the code-fix trajectory.
| Model | Baseline RM@32 | RM@32 with SWE-RM | AUC | ECE |
|---|---|---|---|---|
| Qwen3-Coder-Max | 67.0% | 74.6% | 0.752→0.768 | 0.283→0.047 |
AUC (area under the ROC curve) and ECE (expected calibration error) further improve after applying SWE-RM, indicating better ranking and calibration of success probabilities.
3. Evaluation Protocol and Metrics
Experiments use the SWE-Bench Verified dataset, comprising 500 real-world GitHub issues, each evaluated with 32 independent samples per model instance, totaling 16,000 trajectories. The reward model training leverages approximately 100,000 trajectories collected from multiple policy models (including Qwen3-Coder-Max and external models) deployed within the OpenHands and SWE-Agent scaffolding systems, and sources data from four major datasets (SWE-Gym, SWE-rebench, SWE-smith, R2E-Gym). Data mixture (on-policy and off-policy) and a positive:negative ratio of 2:1 in training yield optimal calibration and ranking performance. Model context lengths are extended up to 256k tokens, ensuring that complex, long multi-turn patches are retained without truncation.
4. Theoretical Rationale for Performance Uplift
Traditional test-case-based (execution-based) verifiers only provide binary signals—pass or fail—which result in sparse or ambiguous feedback during both TTS and reinforcement learning (RL). SWE-RM addresses this by generating a continuous reward for every trajectory, thus enabling:
- Fine-grained feedback: Scores that distinguish otherwise indistinguishable successes or failures among candidate trajectories.
- Strong discrimination: High AUC values ensure that resolutions consistently rank above non-resolutions, yielding more reliable gradient signals during RL.
- Excellent calibration: Low ECE aligns reward magnitudes with the true probability of task resolution, preventing training instabilities due to over- or under-confidence.
- Enhanced generalization: Large-scale, mixed-policy/source, and long-context training data reduce out-of-distribution (OOD) errors, especially for Qwen3-Coder-Max’s multi-file, multi-turn patches (Shum et al., 26 Dec 2025).
5. Reinforcement Learning with Hybrid Reward
Qwen3-Coder-Max itself is not RL-trained in the original experiments; however, a similar 30B policy is used to demonstrate hybrid reward efficacy. The reward combines unit-test pass/fail signals with the execution-free SWE-RM score:
On SWE-Bench Verified, this approach increases pass@1 from 51.8% (execution-based only) to 54.8% for the alternative model, illustrating SWE-RM’s utility in RL contexts.
6. Limitations and Documentation Status
The “Qwen3 Technical Report” (Yang et al., 14 May 2025) does not document Qwen3-Coder-Max in its architecture, training protocols, or evaluation sections. No additional pretraining, instruction tuning, or implementation specifics are available for Qwen3-Coder-Max in current open-source technical reports beyond its use as a black-box policy and benchmark agent in SWE-RM evaluations. All claims regarding Qwen3-Coder-Max’s behavior, benchmark status, and performance derive specifically from SWE-RM experimental results (Shum et al., 26 Dec 2025). Any further architectural or methodological details would require separate documentation.
7. Context and Significance
Qwen3-Coder-Max, as evaluated with SWE-RM, establishes a new state-of-the-art among open-source agents in the 30B parameter regime for realistic software engineering tasks. Its performance showcases the benefits of execution-free, calibrated reward modeling for scaling coding agents to challenging, long-context, multi-turn settings, and motivates ongoing research into the synergy of large code LLMs and advanced reward modeling frameworks.