Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 162 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 426 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Kimi Linear: Hybrid Linear Attention

Updated 31 October 2025
  • Kimi Linear is a hybrid linear attention architecture that combines Kimi Delta Attention (KDA) and Multi-Head Latent Attention (MLA) for scalable, efficient modeling.
  • It employs fine-grained channelwise gating and a chunkwise DPLR algorithm to reduce key-value cache usage by up to 75% and improve decoding throughput sixfold.
  • Extensive experiments show that Kimi Linear delivers robust performance across short-context, long-context, and reinforcement learning regimes with superior generalization.

Kimi Linear is a hybrid linear attention architecture that presents a paradigm shift in scalable LLM design by matching or surpassing the performance and expressiveness of full attention mechanisms, while delivering significantly higher efficiency in computational speed and memory usage. At its core lies the Kimi Delta Attention (KDA) module, which utilizes fine-grained channelwise gating and a highly optimized chunkwise algorithm based on Diagonal-Plus-Low-Rank (DPLR) transition matrices. The entire architecture is implemented as a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA), supporting broad drop-in adoption for both pretraining and downstream tasks. Extensive experiments demonstrate that Kimi Linear achieves superior results across short-context, long-context, and reinforcement learning regimes with substantial reductions in key-value cache and up to sixfold improvements in decoding throughput at long-range contexts (Team et al., 30 Oct 2025).

1. Architectural Principles and Layer Design

Kimi Linear's architecture interleaves expressive linear attention with selective full-attention layers, specifically employing a 3:1 KDA:MLA stacking ratio. Every block comprises three KDA layers followed by one MLA layer. The MLA layers operate without position encoding (NoPE), with positional modeling responsibilities assigned to the KDA stack.

  • Activated parameters: 3B per-token, from a pool of 48B total parameters.
  • Hybrid structure: Combines the computational benefits of linear attention with the rich relational modeling of full MLA, allowing direct comparison and transition across attention mechanisms.

KDA (Kimi Delta Attention) extends Gated DeltaNet (GDN) via channelwise gating (per-dimension control) as opposed to headwise gating in prior designs (Mamba2, RetNet). This refinement enables data-dependent, feature-level modulation of memory decay, dynamically supporting complex long-range retrieval.

2. Kimi Delta Attention (KDA): Gating and Memory Update Mechanism

The central innovation of Kimi Linear is its channelwise-gated fast-weight memory update:

St=(Iβtktkt)Diag(αt)St1+βtktvt\mathbf{S}_t = \left(\mathbf{I}-\beta_t\bm{k}_{t}\bm{k}_{t}^{\top}\right)\operatorname{Diag}\left(\bm{\alpha}_t \right)\mathbf{S}_{t-1} + \beta_t\bm{k}_{t}\bm{v}_{t}^{\top}

ot=Stqt\bm{o}_t = \mathbf{S}_t^\top \bm{q}_t

  • kt,qt,vt\bm{k}_t,\bm{q}_t,\bm{v}_t are the key, query, and value at timestep tt.
  • Diag(αt)\operatorname{Diag}(\bm{\alpha}_t) is the channelwise forget gate, computed as a low-rank projection and sigmoid of the input (αt=σ(WαWαxt)\bm{\alpha}_t = \sigma(\mathbf{W}_\alpha^\uparrow \mathbf{W}_\alpha^\downarrow \bm{x}_t)).
  • βt\beta_t is a per-timestep, data-dependent learning rate (βt=σ(Wβxt)\beta_t = \sigma(\mathbf{W}_\beta \bm{x}_t)).

Post-attention, an additional output gating module applies a low-rank, data-dependent mask:

ot=Wo(σ(WgWgxt)RMSNorm(KDA Output))\bm{o}_t = \mathbf{W}_o(\sigma(\mathbf{W}_g^\uparrow \mathbf{W}_g^\downarrow \bm{x}_t) \odot \operatorname{RMSNorm}(\text{KDA Output}))

This configuration allows KDA to perform online gradient descent on an implicit reconstruction loss, with each channel's decay parameter serving as a dynamic positional encoding, thereby preserving both expressiveness and compute efficiency.

3. Chunkwise Parallel Algorithm and DPLR Transition Matrix

To maximize hardware efficiency, Kimi Linear deploys a chunkwise processing algorithm. Sequences are partitioned into chunks of size CC, each processed in parallel using Householder/WY representation for stable prefix-scan-like updates on the fast-weight memory state.

  • Series of rank-1 updates within each chunk are compressed, minimizing computational overhead and instability.
  • Specialized DPLR matrices optimize both the number of matrix multiplications and chunking steps:
    • By parameter tying (low-rank parameters set equal to keys), KDA reduces the operational complexity of general DPLR approaches.
    • Achieves nearly 2x operator-level speedup compared to unconstrained DPLR kernels at scale (see Fig. 7 in (Team et al., 30 Oct 2025)).

The chunkwise algorithm eliminates costly matrix inversions found in conventional SSMs, facilitating prefix-parallel fast-weight update compatible with tensor core accelerators and allowing efficient batch-based inference and training.

4. Expressiveness and Task Generalization

A persistent limitation of prior linear attention mechanisms was their inability to match the relational modeling and in-context retrieval capabilities of full softmax attention, particularly in short-context regimes. Kimi Linear addresses these by:

  • Leveraging fine-grained, data-dependent channelwise gating, which analogously fulfills and exceeds the role of RoPE frequencies, but in a learnable, contextually dynamic fashion.
  • Achieving parity or better accuracy than MLA and GDN-H baselines in general language understanding, math, reasoning, and code tasks, as evidenced in Table 1 and Table 3 of (Team et al., 30 Oct 2025).
  • Demonstrating robust performance on long-context benchmarks (up to 1M tokens) in challenging tasks like MRCR, RULER, and RepoQA, consistently outperforming dense attention and other linear architectures.
  • Supporting sophisticated RL-finetuning: faster convergence and superior performance in math/reasoning RL tasks (see Fig. 6).

Ablation studies confirm the additive benefits of channelwise gating, output gates, and convolutional key-query projections.

5. Computational and Memory Efficiency

Kimi Linear achieves significant hardware resource savings:

  • KV cache usage: Up to 75% reduction compared to full attention, scaling linearly with context length rather than quadratically.
  • Decoding throughput: Up to 6x speedup at 1M context window for generation, with 2–3x prefill acceleration (see Fig. 2).
  • Compute efficiency: Per-head FLOPs are linear in sequence length TT, with lean recurrence and minimal per-layer state (O(1) per head).
  • Scalability: Compute-optimal training scaling law experiments show 1.16x efficiency over full MLA, facilitating larger batch sizes and model replicas for the same hardware.

This enables practical deployment of long-context and batch-intensive LLM agents, making Kimi Linear suitable as a foundation for industrial LLM serving with superior utility per hardware dollar.

6. Implementation, Open Source Availability, and Practical Deployment

Kimi Linear's kernels and checkpoints are openly released:

  • CUDA kernel and vLLM integration: Efficient, production-ready code available at github.com/fla-org/flash-linear-attention/tree/main/fla/ops/kda, supporting fast training and serving.
  • Pretrained models: 48B total parameter checkpoints (both pretrained and instruction-tuned) are accessible at huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct.
  • Drop-in compatibility: No modifications needed for integration with standard LLM infrastructure (vLLM, serving frameworks, memory managers).
  • Long-context and batch efficiency: Immediate support for up to 1M context windows, batch sizes, and RL-style reasoning agents.

7. Impact and Future Directions

Kimi Linear decisively moves linear attention architectures from the status of a computational compromise to that of a preferred solution, combining the efficiency of recurrent models with the expressiveness and generalization capacity typically reserved for full-attention transformers. The architecture opens avenues for:

  • Scaling long-context LLMs for high-throughput applications (RL agents, code generation, retrieval-augmented generation).
  • Resource allocation optimizations in distributed and cloud serving environments.
  • Further architectural research into chunkwise algorithms and fine-grained gating, as well as task-conditional hybridization with full attention.

A plausible implication is that future LLM engineering for large-scale deployment will increasingly adopt hybrid linear attention designs as the default, balancing quality, efficiency, and operational cost.


Property Kimi Linear Value Dense Attention (MLA)
KV Cache Usage Up to 75% lower Quadratic (high)
Max Decoding Throughput Up to 6× at 1M context
Instruction Benchmark Outperforms MLA & GDN-H in most categories -
Long-Context Scaling Up to 1M context, linear with sequence Quadratic scaling
Implementation Support Open-source CUDA/vLLM Standard transformer

Kimi Linear establishes a new standard for attention architectures in LLMs, offering a demonstrably efficient and expressive solution adaptable to the evolving computational demands of next-generation LLMs (Team et al., 30 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Kimi Linear.