Kimi K2: Open Agentic Intelligence Model

Updated 2 August 2025

Kimi K2 is an open agentic intelligence model built on a trillion-parameter MoE transformer architecture that enables dynamic tool use and efficient computational performance.
It employs the innovative MuonClip optimizer with QK-clip for stable, efficient pre-training over 15.5 trillion tokens, significantly reducing loss spikes.
The model achieves state-of-the-art results in agentic, coding, and reasoning benchmarks through a comprehensive post-training regime combining supervised tuning and joint RL with self-critique.

Kimi K2 is an open agentic intelligence model built on a Mixture-of-Experts (MoE) transformer foundation, designed to advance the capabilities of open-source LLMs with a focus on dynamic tool-use, complex reasoning, and software engineering. Developed with a total parameter count of approximately 1 trillion and employing advanced optimization and post-training strategies, Kimi K2 demonstrates state-of-the-art performance on multiple standardized agentic, coding, and reasoning benchmarks among non-thinking LLMs (Team et al., 28 Jul 2025).

1. Model Architecture and MoE Design

Kimi K2 is architected as a trillion-parameter (≈1.04T) MoE transformer, with only 32 billion parameters activated on each forward pass. The model implements Multi-head Latent Attention (MLA), featuring a global hidden size of 7168 and an expert hidden size of 2048.

The MoE routing employs 384 total experts, with 8 experts selected per token for each forward computation (sparsity setting: 48).
Attention layers are configured with 64 heads, which is half the number used in prior models like DeepSeek-V3, optimizing memory and computational requirements for long sequences.
Only a small, dynamically selected portion of the network is involved per inference step, conferring both token and computational efficiency while preserving the representational capacity of a trillion-parameter model.

This architecture capitalizes on the empirical scaling law that increasing expert count (with fixed activation) enhances performance while reducing FLOPs per token, provided the activation is sufficiently sparse.

Architectural Aspect	Kimi K2 Setting	Remarks
Total parameters	~1 trillion	MoE with 384 experts
Activated parameters/token	32 billion	8 out of 384 experts per step
Hidden size	7168	Model-wide
MoE expert hidden size	2048	Per expert
Attention heads	64	Reduced for long-context efficiency

2. Training Algorithms and Optimization

The model's pre-training uses the MuonClip optimizer, which extends the Muon optimizer with enhanced stability and efficiency:

Muon already provides token efficiency and includes weight decay and a variant of RMS scaling (AdamW-style), but at scale suffers from instability in attention modules (exploding logits).
MuonClip introduces the "QK-clip" mechanism: each attention head's maximum logit, $S_{max}^{(h)} = (1/\sqrt{d}) \cdot \max_{i,j} (Q^{(h)}_i \cdot K^{(h)}_j{}^T)$ , is monitored and clipped if it exceeds a threshold (τ=100).
The clipping update per head is:

$W_q^{(h)} \leftarrow \gamma_h^\alpha W_q^{(h)}, \quad W_k^{(h)} \leftarrow \gamma_h^{1-\alpha} W_k^{(h)}$

where $\gamma_h = \min(1, \tau / S_{max}^{(h)})$ and $\alpha=0.5$ .

For MLA layers, QK-clip acts only on the head-specific parameters, preserving other weight components.

MuonClip prevents loss spikes entirely, enabling stable pre-training on 15.5 trillion tokens.

Optimizer	Logit Stability	Token Efficiency	Unique Features
Muon	No	Yes	AdamW-style, RMS scaling
MuonClip	Yes	Yes	QK-clip, no loss spikes

3. Post-Training Regime and Agentic Data Synthesis

Kimi K2's post-training is designed for agentic capability, proceeding through several stages:

Supervised instruction fine-tuning: Utilizes a large, diversified dataset spanning general instruction, coding, mathematics, and tool usage.
Agentic data synthesis pipeline: Automatically creates multi-turn tool-use trajectories. The pipeline encompasses:
1. Tool specification (drawn from real-world and synthetic tool pools).
2. Rubric-driven agent and task generation, with varied system messaging.
3. Multi-turn demonstrations within simulated (and occasionally real) environment contexts.
Joint RL with self-critique: The RL stage employs verifiable rewards (e.g., code test passes, objective math grades) plus a self-critique rubric reward, in which the model scores its answers on clarity, engagement, and factuality, with token-length constraints and temperature decay.

This strategy produces tens of thousands of complex, high-quality tool-use exemplars for post-training.

4. Benchmark Performance

Kimi K2 demonstrates state-of-the-art non-thinking performance in both agentic and technical domains, as summarized in the following table.

Benchmark	Kimi K2 Score	Category
Tau2-Bench	66.1	Agentic
ACEBench (En)	76.5	Agentic
SWE-Bench Verified	65.8	Software Engineering
SWE-Bench Multilingual	47.3	Multilingual Coding
LiveCodeBench v6	53.7	Coding
AIME 2025	49.5	Mathematics
GPQA-Diamond	75.1	Advanced Reasoning
OJBench	27.1	Coding

On τ²-Bench and ACEBench (English), Kimi K2 surpasses most open and closed-source models in non-thinking settings. Its STEM and code benchmark scores, including LiveCodeBench, AIME, GPQA-Diamond, and OJBench, indicate very strong performance across software engineering, math, and logical reasoning tasks.

5. Core Capabilities and Application Domains

Kimi K2's training and post-training pipeline endow it with advanced capabilities:

Coding and Software Engineering: Excels at competitive code generation, multi-turn code correction, and multilingual software tasks.
Mathematical and Logical Reasoning: Solves advanced competition mathematics and open-ended STEM questions.
Autonomous and agentic tool use: Leverages joint RL and agentic data for planning, tool orchestration, and complex multi-step task-learning, directly targeting software agent and orchestration use-cases.
General Instruction Following: Extensive supervised tuning confers robust general-purpose language modeling suitable for long-context, code-mixed, and technical domains.

These characteristics make Kimi K2 suitable for applications in code assistance, large-scale software refactoring, technical support agents, mathematical research aids, and autonomous system orchestration.

6. Release Strategy and Future Research Directions

The release of both base and post-trained checkpoints is intended to catalyze research into agentic intelligence, allowing the broader community to experiment with and extend:

More sophisticated agentic RL, especially through self-critique and hierarchical reward functions.
Improved synthetic data generation for tool-use and planning.
Scalability and efficiency tradeoffs by varying MoE sparsity and activation.
Real-world tests on large-scale, persistent, agent-based decision systems.

Potential applications include autonomous software development, multi-stage scientific problem solving, and collaborative human–AI systems requiring reliable tool interaction and execution monitoring.

7. Significance and Comparison with Existing Models

Kimi K2 achieves leading performance among open non-thinking models, often outperforming alternatives such as DeepSeek-V3 and Qwen3 on a majority of standardized agentic and coding benchmarks. Its core innovation lies in combining a token- and memory-efficient, ultra-sparse MoE transformer architecture; robust, logit-stabilized optimizer (MuonClip with QK-clip); and a post-training approach that fuses large-scale agentic synthesis with RL and self-assessment. The model architecture, optimizer, and open release position Kimi K2 as a central resource for researchers pursuing the frontier of agentic LLMs (Team et al., 28 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Kimi K2: Open Agentic Intelligence (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Kimi K2.