Kimi K2: Open Agentic Intelligence Model
- Kimi K2 is an open agentic intelligence model built on a trillion-parameter MoE transformer architecture that enables dynamic tool use and efficient computational performance.
- It employs the innovative MuonClip optimizer with QK-clip for stable, efficient pre-training over 15.5 trillion tokens, significantly reducing loss spikes.
- The model achieves state-of-the-art results in agentic, coding, and reasoning benchmarks through a comprehensive post-training regime combining supervised tuning and joint RL with self-critique.
Kimi K2 is an open agentic intelligence model built on a Mixture-of-Experts (MoE) transformer foundation, designed to advance the capabilities of open-source LLMs with a focus on dynamic tool-use, complex reasoning, and software engineering. Developed with a total parameter count of approximately 1 trillion and employing advanced optimization and post-training strategies, Kimi K2 demonstrates state-of-the-art performance on multiple standardized agentic, coding, and reasoning benchmarks among non-thinking LLMs (Team et al., 28 Jul 2025).
1. Model Architecture and MoE Design
Kimi K2 is architected as a trillion-parameter (≈1.04T) MoE transformer, with only 32 billion parameters activated on each forward pass. The model implements Multi-head Latent Attention (MLA), featuring a global hidden size of 7168 and an expert hidden size of 2048.
- The MoE routing employs 384 total experts, with 8 experts selected per token for each forward computation (sparsity setting: 48).
- Attention layers are configured with 64 heads, which is half the number used in prior models like DeepSeek-V3, optimizing memory and computational requirements for long sequences.
- Only a small, dynamically selected portion of the network is involved per inference step, conferring both token and computational efficiency while preserving the representational capacity of a trillion-parameter model.
This architecture capitalizes on the empirical scaling law that increasing expert count (with fixed activation) enhances performance while reducing FLOPs per token, provided the activation is sufficiently sparse.
| Architectural Aspect | Kimi K2 Setting | Remarks |
|---|---|---|
| Total parameters | ~1 trillion | MoE with 384 experts |
| Activated parameters/token | 32 billion | 8 out of 384 experts per step |
| Hidden size | 7168 | Model-wide |
| MoE expert hidden size | 2048 | Per expert |
| Attention heads | 64 | Reduced for long-context efficiency |
2. Training Algorithms and Optimization
The model's pre-training uses the MuonClip optimizer, which extends the Muon optimizer with enhanced stability and efficiency:
- Muon already provides token efficiency and includes weight decay and a variant of RMS scaling (AdamW-style), but at scale suffers from instability in attention modules (exploding logits).
- MuonClip introduces the "QK-clip" mechanism: each attention head's maximum logit, , is monitored and clipped if it exceeds a threshold (τ=100).
- The clipping update per head is:
where and .
- For MLA layers, QK-clip acts only on the head-specific parameters, preserving other weight components.
MuonClip prevents loss spikes entirely, enabling stable pre-training on 15.5 trillion tokens.
| Optimizer | Logit Stability | Token Efficiency | Unique Features |
|---|---|---|---|
| Muon | No | Yes | AdamW-style, RMS scaling |
| MuonClip | Yes | Yes | QK-clip, no loss spikes |
3. Post-Training Regime and Agentic Data Synthesis
Kimi K2's post-training is designed for agentic capability, proceeding through several stages:
- Supervised instruction fine-tuning: Utilizes a large, diversified dataset spanning general instruction, coding, mathematics, and tool usage.
- Agentic data synthesis pipeline: Automatically creates multi-turn tool-use trajectories. The pipeline encompasses:
- Tool specification (drawn from real-world and synthetic tool pools).
- Rubric-driven agent and task generation, with varied system messaging.
- Multi-turn demonstrations within simulated (and occasionally real) environment contexts.
Joint RL with self-critique: The RL stage employs verifiable rewards (e.g., code test passes, objective math grades) plus a self-critique rubric reward, in which the model scores its answers on clarity, engagement, and factuality, with token-length constraints and temperature decay.
This strategy produces tens of thousands of complex, high-quality tool-use exemplars for post-training.
4. Benchmark Performance
Kimi K2 demonstrates state-of-the-art non-thinking performance in both agentic and technical domains, as summarized in the following table.
| Benchmark | Kimi K2 Score | Category |
|---|---|---|
| Tau2-Bench | 66.1 | Agentic |
| ACEBench (En) | 76.5 | Agentic |
| SWE-Bench Verified | 65.8 | Software Engineering |
| SWE-Bench Multilingual | 47.3 | Multilingual Coding |
| LiveCodeBench v6 | 53.7 | Coding |
| AIME 2025 | 49.5 | Mathematics |
| GPQA-Diamond | 75.1 | Advanced Reasoning |
| OJBench | 27.1 | Coding |
On τ²-Bench and ACEBench (English), Kimi K2 surpasses most open and closed-source models in non-thinking settings. Its STEM and code benchmark scores, including LiveCodeBench, AIME, GPQA-Diamond, and OJBench, indicate very strong performance across software engineering, math, and logical reasoning tasks.
5. Core Capabilities and Application Domains
Kimi K2's training and post-training pipeline endow it with advanced capabilities:
- Coding and Software Engineering: Excels at competitive code generation, multi-turn code correction, and multilingual software tasks.
- Mathematical and Logical Reasoning: Solves advanced competition mathematics and open-ended STEM questions.
- Autonomous and agentic tool use: Leverages joint RL and agentic data for planning, tool orchestration, and complex multi-step task-learning, directly targeting software agent and orchestration use-cases.
- General Instruction Following: Extensive supervised tuning confers robust general-purpose language modeling suitable for long-context, code-mixed, and technical domains.
These characteristics make Kimi K2 suitable for applications in code assistance, large-scale software refactoring, technical support agents, mathematical research aids, and autonomous system orchestration.
6. Release Strategy and Future Research Directions
The release of both base and post-trained checkpoints is intended to catalyze research into agentic intelligence, allowing the broader community to experiment with and extend:
- More sophisticated agentic RL, especially through self-critique and hierarchical reward functions.
- Improved synthetic data generation for tool-use and planning.
- Scalability and efficiency tradeoffs by varying MoE sparsity and activation.
- Real-world tests on large-scale, persistent, agent-based decision systems.
Potential applications include autonomous software development, multi-stage scientific problem solving, and collaborative human–AI systems requiring reliable tool interaction and execution monitoring.
7. Significance and Comparison with Existing Models
Kimi K2 achieves leading performance among open non-thinking models, often outperforming alternatives such as DeepSeek-V3 and Qwen3 on a majority of standardized agentic and coding benchmarks. Its core innovation lies in combining a token- and memory-efficient, ultra-sparse MoE transformer architecture; robust, logit-stabilized optimizer (MuonClip with QK-clip); and a post-training approach that fuses large-scale agentic synthesis with RL and self-assessment. The model architecture, optimizer, and open release position Kimi K2 as a central resource for researchers pursuing the frontier of agentic LLMs (Team et al., 28 Jul 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free