Kimi K2: Open-Source MoE Transformer

Updated 4 August 2025

Kimi K2 Model is a state-of-the-art, open-source Mixture-of-Experts transformer featuring 1.04 trillion parameters with 32 billion activated per token for efficient specialization.
Its innovative MuonClip optimizer and QK-clip stabilization enable training on 15.5 trillion tokens without loss spikes, ensuring robust convergence.
The model excels in agentic intelligence, coding, mathematics, and reasoning, with open checkpoints supporting further research and real-world applications.

Kimi K2 is a large-scale, open-source Mixture-of-Experts (MoE) transformer model designed for advanced agentic intelligence, software engineering, and reasoning tasks. Developed with 1.04 trillion parameters—of which 32 billion are activated per token via a sparse expert architecture—Kimi K2 embodies state-of-the-art techniques in LLM construction, optimization stability, and post-training. The architecture features significant innovations including the MuonClip optimizer (incorporating QK-clip to control attention instability), extensive data-efficient pre-training on 15.5 trillion tokens, and a comprehensive multi-stage post-training pipeline. Kimi K2's benchmarks in agentic, coding, mathematics, and reasoning domains position it as one of the most capable open-source models, with released checkpoints available for further research exploration and deployment (Team et al., 28 Jul 2025).

1. Model Architecture and Parameterization

Kimi K2 is constructed as a 1.04 trillion-parameter Mixture-of-Experts (MoE) transformer incorporating Multi-head Latent Attention (MLA), reflecting design similarities to DeepSeek-V3. Inference and training utilize sparse expert selection, so that only 32 billion model parameters are “activated” (i.e., engaged in forward and backward computations) per input token. The selection of experts per token leverages the capacity of the MoE architecture for conditional computation, promoting specialization by routing input to sub-networks best suited for particular information or modalities.

The distinction between activated and total parameter count is central to Kimi K2’s efficiency. The overparameterized 1T MoE structure provides capacity for broad domain generalization and robust redundancy, while the 32B activation per token keeps the inference and training computational cost similar to that of a dense model with only 32B parameters.

Parameter Count	Activation Regime	Role
1.04 trillion	All experts (global)	Overparameterized model for generalization
32 billion	Sparse per token	Efficient per-token compute and specialization

This architecture enables highly scalable training and inference while supporting increased specialization among sub-network experts.

2. Optimization: MuonClip and QK-clip Stabilization

Kimi K2 training employs MuonClip, an optimizer built on the Muon family for token-efficient pre-training. MuonClip introduces weight decay and consistent RMS matching, optimizing per-token updates for convergence and parameter smoothness. Central to MuonClip is the QK-clip technique, designed to tackle instability arising from the potential explosion of attention logits in high-capacity transformer models.

In the transformer block, attention is formulated:

$Q^h = XW_q^h, \quad K^h = XW_k^h, \quad O^h = \text{softmax}\left(\frac{1}{\sqrt{d}} Q^h {K^h}^\top \right) V^h$

The mechanism monitors, for each head $h$ :

$S_{\max}^{h} = \frac{1}{\sqrt{d}} \max_{i,j} (Q_i^h \cdot K_j^h)$

If $S_{\max}^h$ exceeds a fixed threshold $\tau$ , it rescales query/key projection weights:

$W_q^h \gets \gamma^{\alpha} W_q^h, \quad W_k^h \gets \gamma^{1-\alpha} W_k^h$

with $\gamma = \min(1, \tau / S_{\max}^h)$ and typical $\alpha = 0.5$ . This adaptive clipping bounds the magnitude of attention logits, directly preventing loss spikes and training divergence at extreme scale.

MuonClip, combined with QK-clip, underpins the stable convergence of Kimi K2, sustaining pre-training across 15.5 trillion tokens without any reported loss spikes.

3. Pre-Training Regimen and Data Synthesis

Pre-training Kimi K2 leverages 15.5 trillion high-quality tokens distributed across knowledge-intensive, mathematical, and synthetic corpora. Targeted synthetic data creation includes extensive use of rephrasing to maximize the diversity and utility of pre-training tokens, particularly focusing on mathematics and knowledge-rich text domains to increase downstream task robustness and reduce overfitting.

The use of MuonClip ensures that the prolonged and large-scale training does not encounter instability or catastrophic divergence, as evidenced by the continuous “zero loss spike” across the complete training trajectory. This suggests that the optimization innovations are a significant enabling factor for long-context, high-capacity model pre-training.

4. Post-Training: Instruction Tuning and Reinforcement Learning

Following pre-training, Kimi K2 undergoes a multi-stage post-training protocol. The first stage consists of supervised fine-tuning with a comprehensive instruction-tuning corpus encompassing general, domain-specific, and tool-use instructions. A large synthetic dataset for agentic behaviors is incorporated—demonstrating capabilities in tool use and multi-step planning.

A subsequent joint reinforcement learning (RL) stage leverages a hybrid reward pipeline: (1) external verifiable rewards on tasks such as coding, mathematics, and STEM, and (2) a self-critique rubric reward that guides the model’s improvement via its own evaluations. This combination enhances both objective and heuristic learning, supporting generalization and higher-order reasoning required for agentic intelligence.

5. Performance on Benchmarks

Kimi K2 attains leading results across a broad set of competitive evaluation benchmarks for agentic intelligence, coding, reasoning, and mathematics:

Benchmark	Domain	Kimi K2 Score
Tau2-Bench	Agentic/tool use	66.1
ACEBench (English)	Agentic/general	76.5
SWE-Bench Verified	Software engineering	65.8
SWE-Bench Multilingual	Software eng. multi	47.3
LiveCodeBench v6	Coding	53.7
AIME 2025	Mathematics	49.5
GPQA-Diamond	Reasoning	75.1
OJBench	Coding (online judge)	27.1

These scores place Kimi K2 at the forefront among all publicly released "non-thinking" (non-chain-of-thought augmented) LLMs, and in several cases rival closed-source models, particularly in software engineering and agentic domains. The results reflect model capacity in tool use, planning, multi-turn reasoning, and competitive programming.

6. Application Domains and Agentic Capabilities

Kimi K2 is positioned for deployment in real-world scenarios that demand autonomous agentic behavior—defined as the model’s ability to perceive, plan, reason, and act adaptively in dynamic contexts. Example domains include automated software development, where coordinated tool use and interaction with code editors or compilers are integral, and in complex mathematical or problem-solving environments.

The superior performance on agentic, coding, and reasoning tasks is facilitated by both the sparse expert MoE design (for specialization) and the specific post-training synthesis of agentic data. This suggests broad applicability in domains where both adaptability and efficiency are critical.

7. Open Model Availability and Future Research

The Kimi K2 project adopts open research practices, making both base and post-trained checkpoints publicly accessible at https://huggingface.co/moonshotai/Kimi-K2-Instruct. This initiative enables the research community to further explore, fine-tune, and operationalize agentic LLM technologies across diverse domains.

A plausible implication is an acceleration of research in autonomous agent design, advanced software engineering automation, and reasoning-intensive domains. The open model structure will likely facilitate benchmarking, ablation studies, and the development of new RL and instruction-tuning pipelines using the Kimi K2 foundation.

Kimi K2 epitomizes the confluence of overparameterized MoE architectures, stabilized large-scale optimization, data-efficient representation learning, and multi-stage post-training for agentic intelligence. Collectively, these elements underlie its demonstrated strengths in software engineering, mathematics, reasoning, and tool use, and provide an extensible resource for ongoing progress in open LLMs (Team et al., 28 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Kimi K2: Open Agentic Intelligence (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Kimi K2 Model.

Kimi K2: Open-Source MoE Transformer

1. Model Architecture and Parameterization

2. Optimization: MuonClip and QK-clip Stabilization

3. Pre-Training Regimen and Data Synthesis

4. Post-Training: Instruction Tuning and Reinforcement Learning

5. Performance on Benchmarks

6. Application Domains and Agentic Capabilities

7. Open Model Availability and Future Research

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Kimi K2: Open-Source MoE Transformer

1. Model Architecture and Parameterization

2. Optimization: MuonClip and QK-clip Stabilization

3. Pre-Training Regimen and Data Synthesis

4. Post-Training: Instruction Tuning and Reinforcement Learning

5. Performance on Benchmarks

6. Application Domains and Agentic Capabilities

7. Open Model Availability and Future Research

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research