Composer 2: Agentic Code Generation LLM

Updated 28 March 2026

Composer 2 is a specialized large language model designed for agentic software engineering, integrating technologies like multi-head latent attention and sparse expert activation.
Its dual-phase training regimen, combining extended pretraining with large-scale reinforcement learning, enables adaptive code generation and precise tool reasoning.
Architectural innovations such as context parallelism and decoupled expert parallelism drive efficient scaling and superior performance on both internal and public benchmarks.

Composer 2 is a specialized LLM designed for agentic software engineering, demonstrating frontier-level coding intelligence, strong long-horizon planning, and robust interactive problem-solving. Trained with domain-specific objectives and advanced infrastructural integration, Composer 2 establishes a new state-of-the-art in code generation, manipulation, and tool reasoning on both internal and public engineering benchmarks (Research et al., 25 Mar 2026).

1. Architectural Foundations and Design Innovations

Composer 2 is built upon a 1.04 trillion-parameter Mixture-of-Experts (MoE) transformer, employing approximately 32 billion active parameters per token. The architecture fundamentally features:

Multi-Head Latent Attention (MLA): Efficient handling of long contexts, supporting window sizes up to 256k tokens.
Sparsely Gated Experts: Enabling extensive parameter scaling with sublinear compute growth by activating a subset of experts per input.
Multi-Token Prediction (MTP) Heads: Integration of MTP for speculative decoding, trained via self-distillation to ensure production-aligned outputs.

Key system-level enhancements over prior Composer models include:

Context Parallelism (CP): Replaces traditional tensor parallelism in attention layers, reducing interprocess communication and supporting extremely long contexts.
Decoupled Expert Parallelism (EP): Allows expert weights to be sharded independently, enabling larger MoE scaling and more efficient dense matrix multiplications (GEMMs).
Self-Summarization: Chains local summary segments, maintaining long-term contextual awareness without exceeding the context window or resource constraints.

Compared to Composer 1.5, these innovations permit both architectural capacity and algorithmic novelty required for coordinated, extended agent workflows (Research et al., 25 Mar 2026).

2. Domain-Specialized Training Regimen

Composer 2’s training is staged in two distinct phases to transform a general LLM into a proficient agentic software engineering model:

2.1 Continued Pretraining

Base Model: Kimi K2.5, preselected via metrics on coding knowledge, state-tracking, and codebase perplexity.
Data Mix: Internal monorepos, open-source repositories, technical documentation, and targeted Q&A; subjected to quality filtering and staged presentation.
Objective: Standard cross-entropy loss,

$L_{\text{CE}}(\theta) = -\,\mathbb{E}_{x\sim\mathcal{D}} \sum_{t} \log p_{\theta}(x_t | x_{<t})$

Sequence Curriculum: Initial long-sequence pretraining at 32k tokens, expanded to 256k token context extension, followed by targeted supervised fine-tuning (SFT).
MTP Distillation: Auxiliary Kullback-Leibler loss aligns logits from the main head and the MTP head,

$L_{\text{MTP}}(\phi) = \mathbb{E}_{x}\sum_{t}\mathrm{KL}(p_{\theta}(\cdot|x_{<t})\,\|\,q_{\phi}(\cdot|x_{<t}))$

Empirical observations confirm that reducing post-SFT cross-entropy loss robustly predicts downstream RL reward.

2.2 Large-Scale Reinforcement Learning

Environment: Each RL episode is executed in a functionally complete Cursor-style harness instantiated on Anyrun Firecracker VMs, matching deployed tools and user environment. Available tools include file-system access, shell, semantic search, build/test tools, and web integration.
Policy Optimization: An on-policy, asynchronous grouped policy-gradient method (related to GRPO / Dr. GRPO).
Gradient Objective:

$\nabla_{\theta}J = \mathbb{E}_{\tau\sim\pi_\theta} [A(\tau)\,\nabla_{\theta}\log\pi_\theta(\tau)]$

where $A(\tau)$ is the group-wide advantage for a trajectory $\tau$ .

KL Regularization: Penalty term for KL divergence from a reference policy, estimated with $k_{1} = -\log r$ .
Reward Structure: Sum of correctness (test/functionality pass), code style (quality, diff cleanliness), and nonlinear length penalty given by

$C_{\mathrm{length}\{k,q\}}(x) = \frac{(1 + k x)^{1 - q} - 1}{k(1 - q)}$

where $x$ aggregates thinking tokens, tool calls, and conversational turns. This reward design penalizes excessive verbosity on easy tasks while allowing deep reasoning for genuinely hard problems.

Self-Summarization in RL: Parallel prompt–response chains are credited for informative summaries, which functionally recap long trajectories without context overflows.

The staged procedure—domain-adapted pretraining followed by goal-driven RL—enables the emergence of high-level planning, long-horizon coherence, and adaptive code-editing strategies (Research et al., 25 Mar 2026).

3. Unified Agentic Infrastructure

Composer 2’s training, validation, and deployment are all integrated into a unified infrastructure stack, centered on:

Ray + PyTorch Orchestration: Underpins distributed training and microservice separation (training, environment, inference, online evaluation).
End-to-End Cursor Harness: Ensures no tool drift between the agent's learning and user deployment environments; every RL rollout leverages the same toolset and APIs used in production.
Firecracker VMs with Dev Toolchains: Serve as fully featured, forkable sandboxes.
Production Parity and Shadow Deployment: Hot-syncing of inference weights ensures live settings reflect the experimental state, and backend feature parity is strictly maintained.

This infrastructure enables scalable, efficient, and reproducible iterative training, facilitating rapid deployment and evaluation cycles (Research et al., 25 Mar 2026).

4. Benchmarking and Empirical Performance

Composer 2 undergoes comprehensive evaluation against internal and public large-scale software engineering benchmarks:

Internal: CursorBench-3

Task Scope: Terse bug reports, major refactorings (median 181 lines changed), and ambiguous specification (median 390 character prompts).
Evaluation Metric: Accuracy defined by end-state equivalence to curated gold diffs, including code quality checks.
Results: 61.3% accuracy on CursorBench-3, a 37% relative improvement over Composer 1.5 (44.2%) and a 61% gain over Composer 1 (38.0%).

Public: SWE-bench Multilingual and Terminal-Bench

SWE-bench Multilingual: Agent is prompted to resolve real GitHub issues in multiple languages, with autonomous test execution. Composer 2 scores 73.7%, exceeding Composer 1.5 by 7.8% and Composer 1 by 16.8%.
Terminal-Bench: Realistic command-line tasks, Composer 2 achieves 61.7%, competitive with GPT-5.4 (66.5% best-reported) and Anthropic's Claude Code models, but with substantially reduced inference cost.

Composer 2 is positioned on the Pareto frontier for cost-accuracy tradeoffs among frontier generalist LLMs (Research et al., 25 Mar 2026).

5. Long-Horizon Planning and Multi-Step Execution

A core competency of Composer 2 is robust, coherent, multi-step reasoning across extended workflows, including tool invocation, code editing, and iterative verification:

Build-Tool Transpilation Bug: Composer 2 synthesizes and applies a block-scope fix after tracing variable lifetime issues with retry loops, demonstrating self-summarization and cross-file reasoning.
Streaming Prefix-Detection: The model collates across 954 log files, generates bespoke prefix-detection logic in Python, tunes hyperparameters, and designs test cases, managing both chain-of-thought and explicit tool interactions.

These examples illustrate scalable agentic behaviors that couple LLM abilities with real-world environment operations (Research et al., 25 Mar 2026).

6. Ablation Studies and Component Analysis

Experiments elucidate which factors contribute to Composer 2’s performance:

Pretraining Size vs. RL Reward: Larger, more code-specialized models yield nearly linear improvements in RL reward, confirming the primacy of domain-adaptive pretraining.
KL Divergence Estimator Choice: The $k_3 = (r - 1) - \log r$ estimator incurs unacceptable variance for large divergences, so $k_1 = -\log r$ is used exclusively to stabilize RL.
Rollout Length Handling: Over-long rollout masking does not improve, and possibly degrades, learning; thus, full-length unmasked trajectories are used, with self-summarization to avoid context overflow.
Length Penalty and Advantage Normalization: Retaining a nonlinear penalty and eschewing length normalization achieves superior average and best-of-K outcomes, preserving solution diversity.

These findings collectively support the design choices underlying Composer 2’s training and optimization pipeline (Research et al., 25 Mar 2026).

Composer 2 exemplifies the synergistic combination of scale, specialized pretraining, and agentic reinforcement learning, tightly integrated within a bespoke engineering infrastructure and validated on challenging real-world software development tasks. Its ascendancy in benchmark performance and robust agentic behavior establishes a model for future domain-specialized LLM development in software engineering (Research et al., 25 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Composer 2 Technical Report (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Composer 2.