Kimi Linear: An Expressive, Efficient Attention Architecture (2510.26692v1)

Published 30 Oct 2025 in cs.CL and cs.LG

Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Summary

The paper presents the novel Kimi Delta Attention (KDA) module that refines memory regulation in RNNs to surpass full attention in diverse contexts.
It combines a 3:1 ratio of KDA to global attention and Diagonal-Plus-Low-Rank matrices, reducing KV-cache usage by 75% while optimizing performance.
Empirical results demonstrate up to 6x faster decoding and robust performance on benchmarks, underscoring its value for large-scale language model applications.

Kimi Linear: An Expressive, Efficient Attention Architecture

Introduction

The paper introduces Kimi Linear, an attention architecture designed to enhance expressivity and efficiency beyond that of full attention mechanisms. Central to Kimi Linear is the novel Kimi Delta Attention (KDA) module, which refines prior linear attention through enhanced gating mechanisms that enhance the use of finite memory states in recurrent neural networks (RNNs). These mechanisms allow Kimi Linear to surpass full attention in both short and long contexts as well as reinforcement learning scenarios, positioning it as a compelling choice for large-scale LLM applications.

Architecture Design

The architecture of Kimi Linear integrates the KDA module with periodic full attention layers, creating a hybrid architecture that utilizes a 3:1 ratio of KDA to global attention. This design reduces KV-cache usage by 75% while retaining high performance across tasks of varying input and output lengths. The use of a chunkwise algorithm with specialized Diagonal-Plus-Low-Rank (DPLR) transition matrices enhances both computational efficiency and consistency with traditional Hebbian learning rules.

Implementation Details

Kimi Linear employs a channel-wise gating mechanism to improve RNN memory regulation, contrasting with the head-wise gates used in models like Mamba2. This implementation is optimized through chunkwise parallelization, which significantly improves hardware efficiency during large-scale computations. The transition to DPLR matrices is done in tandem with fine-grained control over memory decay, crucial for maintaining state stability over long sequences.

The authors open-sourced the KDA kernel and provided vertical integration with vLLM for seamless incorporation into existing attention frameworks. They also released several checkpoints of pretrained models to facilitate further research in this domain.

Performance and Comparative Analysis

The empirical results of Kimi Linear highlight its superiority in various benchmarking scenarios, showing consistent outperformance over full attention and other baseline architectures. Key metrics include a substantial improvement in decoding throughput (up to 6x faster at 1M tokens) and a strong competitive edge across multiple structured tasks such as MMLU-Pro, Ruler, and others.

Kimi Linear showcases potential for significant efficiency improvements in both training and inference phases compared to traditional full attention models. The improved efficiency is particularly notable in scenarios requiring long-context handling and high-throughput decoding, making it ideal for real-time processing applications in NLP and beyond.

Conclusion

Kimi Linear stands as an advanced hybrid attention architecture tailored to the requirements of modern large-scale LLMs. Its novel attention mechanism, KDA, coupled with optimized computation strategies, positions it as a drop-in replacement for full attention systems, enabling superior performance with considerably reduced resource consumption. By offering state-of-the-art improvement in both speed and memory efficiency, Kimi Linear facilitates the efficient development of next-generation LLMs. The release of open-source tools and models further supports the research community's exploration and application of this advanced architecture.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

1) What this paper is about

The paper introduces Kimi Linear, a new way for LLMs to “pay attention” to long texts more quickly and with less memory, without losing accuracy. It’s built around a core idea called Kimi Delta Attention (KDA). The big claim: under fair, same-training comparisons, Kimi Linear beats regular full attention models on many tasks (short texts, very long texts, and even reinforcement learning-like tasks), while using less memory and running much faster on long inputs.

2) What questions the researchers asked

In simple terms, the authors asked:

Can we design an attention method that is as good as, or better than, standard attention on quality, but much faster and lighter, especially for very long inputs?
Can we make linear attention (a faster style of attention) more expressive, so it doesn’t lose important details?
Can we create a hybrid model that mixes fast linear layers with a few full attention layers to keep the best of both worlds?

3) How they did it (methods explained simply)

First, some quick background with everyday analogies:

Attention in LLMs: Imagine you’re writing an essay and constantly flipping back through earlier pages to find relevant notes. Standard “full” attention compares each new word to all earlier words. That’s accurate but slow and memory-hungry, especially when the text is very long.
Linear attention: This is a faster shortcut. Instead of comparing everything to everything, it keeps a clever “summary” of what came before, so each new step is quicker. The problem: older linear attention methods sometimes weren’t as accurate.

What Kimi Linear does:

Kimi Delta Attention (KDA): Think of the model keeping a running “cheat sheet” (a memory matrix) it updates at each step. KDA adds fine-grained “forgetting knobs” that decide how much to keep or forget for each tiny piece of information (each feature/channel), not just one global knob. This helps the model remember what matters and let go of what doesn’t—like carefully curating your notes instead of piling everything up.
The “delta rule” idea: Each time the model sees a new word, it slightly corrects its cheat sheet to better map “keys” (what to look for) to “values” (what information to retrieve). This keeps the memory useful and up to date.
Specialized math for speed (DPLR, chunking, WY, UT—made simple):
- Chunking: The model splits very long text into chunks (like chapters), processes each chunk efficiently, and connects them. This keeps the GPU busy and fast.
- Special matrix tricks (WY representation and UT transform): These are engineering methods to combine many small update steps into fewer big ones, cutting down on the number of operations so the computer runs faster without changing the result.
- A tailored version of “Diagonal-Plus-Low-Rank” (DPLR): Think of this as a compact way to update the cheat sheet. KDA uses a specialized version that removes extra steps and avoids tricky numerical issues, so it can use faster math on modern hardware.
Hybrid architecture (3:1): For every 4 attention layers, 3 are KDA (fast) and 1 is full attention (global). The full attention layers act like occasional “global check-ins” to keep the big picture, while most layers are the speedy KDA ones. This cuts memory for the key–value cache by up to 75% while keeping quality high.
Model size: They trained a large model with 48 billion total parameters, but only about 3 billion are “active” at a time (like only turning on the parts you need), which saves compute.

4) What they found and why it matters

Main results:

Better accuracy under the same training setup: With the same recipe and 1.4 trillion training tokens, Kimi Linear beat a strong full attention baseline (MLA) on short-context and long-context benchmarks.
Much faster on very long inputs: At a 1 million token context (extremely long), Kimi Linear generates each new token up to about 6× faster than full attention (1.84 ms vs. 11.48 ms per token).
Less memory: Up to 75% reduction in key–value cache size during generation. This lets you handle longer inputs, larger batches, or both.
Strong long-context performance: It’s “Pareto-optimal,” meaning it gives a great trade-off between speed and quality—top scores while also being very fast.
Works well for agent-like and RL-style workloads: These tasks often require handling long action sequences and tool-use histories at inference time. Kimi Linear’s speed and memory savings really help there.

Why this matters:

Real-world apps (code assistants, research copilots, planning agents) often need huge contexts and quick responses. Faster, lighter attention means they can be more interactive, handle longer documents, and cost less to run.

5) What this could lead to (impact)

Drop-in replacement: Kimi Linear is designed to plug into existing systems where full attention is used today, but with better speed and memory efficiency.
Longer, richer interactions: Handling million-token contexts more practically opens the door to assistants that remember long conversations, multi-step plans, and big project histories.
Cheaper and greener: Using less memory and compute lowers costs and energy use.
Community progress: The authors open-sourced their KDA kernel, integrated it with vLLM (a popular inference engine), and released trained checkpoints. This allows researchers and developers to build on their work quickly.

In short, Kimi Linear shows that we don’t have to choose between speed and smarts for long-context LLMs. With its fine-grained memory control and efficient hybrid design, it offers both.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, concise list of concrete knowledge gaps, limitations, and open questions left unresolved by the paper. Each item is framed to be actionable for future research.

Sensitivity to hybrid ratio: No ablation on the 3:1 KDA-to-global-attention layer ratio; it is unclear how quality, memory, and speed trade off as this ratio varies or as global layers are placed non-uniformly.
Retrieval limits of finite-state memory: The extent to which periodic global-attention layers compensate for KDA’s finite-state constraints is unquantified on adversarial long-range retrieval (e.g., needle-in-a-haystack at 1M tokens, multi-needle with distractors).
Formal expressivity analysis: No theoretical characterization of KDA’s representational power relative to softmax attention or general DPLR, especially given the binding of both low-rank vectors to keys.
Stability and convergence guarantees: Lack of formal proofs or bounds on stability, error accumulation, or convergence of the chunkwise WY/UT-transformed recurrences under fine-grained decay over very long sequences.
Numerical precision under long horizons: No empirical error analysis of half-precision (and prospective FP8/INT8) accumulation with per-channel decay across 1M+ tokens, chunk boundaries, and layered compositions.
Chunk-size effects: Missing paper of chunk length C on quality, latency, and numerical stability (e.g., boundary artifacts, decay miscalibration across chunks).
Decay gate behavior: No diagnostics on the learned per-channel decay α (e.g., distributions, saturation, time constants), nor regularization strategies to prevent pathological forgetting or memory lock-in.
Alternative decay parameterizations: Unclear whether tying both DPLR vectors to keys reduces expressivity; no ablation against general DPLR or independent parameterizations at equal compute.
Interaction with positional encodings: Claims about “relaxing RoPE” are not validated with comparisons to standard/learned positional schemes or hybrid combinations across short/long contexts.
Training recipe transparency and reproducibility: Details on data mixture, tokenization, optimizer settings, regularization, schedule, and seeds are insufficient to fully replicate the 1.4T-token runs.
Attribution of gains: No controlled paper to disentangle architectural effects (KDA/hybridization) from data mixture or MoE design choices in the 48B-total/3B-activated model.
Baseline coverage: Comparisons omit strong recent linear/hybrid baselines (e.g., RWKV-7, RetNet variants, Hyena, Jamba-like hybrids) and advanced full-attention systems with KV compression/page attention.
Prefill vs decode phases: Speedups are reported for decoding; prefill throughput, memory footprint, and end-to-end latency (mixed prefill/decode workloads typical of agents) are not quantified.
Small-batch and dynamic-shape regimes: It remains unknown whether speedups persist under small batch sizes, highly variable prompt lengths, tool-use interruptions, or streaming/online decoding.
RL/test-time scaling evidence: The RL claims lack detailed tasks, protocols, and metrics (e.g., tool-use latency loops, on-policy rollouts), making it unclear where the architecture confers practical RL advantages.
Long-context training vs extrapolation: The training context window, curriculum, and extrapolation behavior to 1M tokens are not specified; scaling curves as a function of trained context length are missing.
Memory accounting: KV-cache reduction is claimed up to 75%, but a full, layer-by-layer memory budget (including KDA state S, buffers for WY/UT transforms) during prefill and decode is not provided.
Quantization readiness: Compatibility with weight-only, KV, and activation quantization (FP8/INT8) is untested; impact on stability and accuracy with per-channel decay remains unknown.
Integration with advanced decoding: Effects on speculative decoding, beam search, chunked/paged attention schedulers, and vLLM’s paging behavior are not characterized for both correctness and speed.
Robustness and OOD generalization: No evaluation under adversarial prompts, distribution shifts, or noisy long inputs to assess whether fine-grained decay induces fragility or hallucination changes.
Interpretability of memory usage: It is unclear which channels/heads store long vs short-term information; tools to visualize/measure what is retained/forgotten over time are absent.
Multi-modal and encoder–decoder applicability: The method is only demonstrated for decoder-only LMs; extensions to cross-attention, encoder–decoder seq2seq, and multi-modal pipelines are unexplored.
Scalability extremes: Generality of gains at both small scales (<=1B params) and very large scales (>=100B total) is unknown; cost–quality scaling laws specific to KDA/hybridization are not reported.
Training stability and failures: No analysis of gradient norms, conditioning of the triangular solves, or failure modes (e.g., divergence with large chunks, extreme sequence lengths, or low precision).
Energy and cost efficiency: Wall-clock training efficiency, GPU-hours, and energy consumption versus full attention and other linear/hybrid baselines are not reported.
Licensing and deployment portability: Kernel portability and performance on non-NVIDIA hardware (AMD, TPU), multi-node setups, and mixed-precision toolchains are not evaluated.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now using the released KDA kernel, vLLM integration, and instruction-tuned checkpoints. They exploit Kimi Linear’s 3:1 KDA-to-global attention hybrid, up to 75% KV-cache reduction, and up to 6× decoding throughput at 1M tokens.

LLM serving cost and latency reduction (software, cloud)
- What: Swap full attention layers for Kimi Linear in existing vLLM-serving stacks to decrease memory footprint and time-per-token (TPOT), enabling larger batches and faster interactive responses.
- Tools/products/workflows: vLLM with the open-source KDA kernel; “drop-in” model deployment using Kimi-Linear-48B-A3B-Instruct.
- Assumptions/dependencies: GPU support for half precision/Tensor Cores; full-attention layers retained at the prescribed 3:1 ratio; quality parity depends on similar inference settings and prompt formatting.
High-context retrieval-augmented generation (RAG) for long documents (finance, legal, healthcare, policy)
- What: Process 10-K filings, contracts, EMRs, and legislation in single passes (hundreds of thousands to 1M tokens), reducing chunking and cross-chunk stitching.
- Tools/products/workflows: Enterprise RAG pipelines that feed longer contexts to LLMs; fewer retrieval hops; simplified orchestration for eDiscovery or regulatory analysis.
- Assumptions/dependencies: High-quality retrieval still needed for precision; token costs and throughput must be balanced; numerical stability at extreme lengths depends on tested kernels and careful prompt management.
Agentic workflows and tool-calling loops with faster test-time scaling (software, operations, customer support)
- What: Speed up planning, tool-use, and multi-step decision sequences at inference (e.g., chain-of-thought, scratchpads, tool outputs), enabling more steps under the same latency budget.
- Tools/products/workflows: Orchestrators (LangChain/LlamaIndex-style) with longer intermediate traces; faster iterative planners; improved TOT/Self-Reflect loops.
- Assumptions/dependencies: Alignment/safety guardrails; deterministic sampling configurations; observability for trace-length growth.
Batch serving efficiency for high-throughput inference (SaaS platforms, model APIs)
- What: Increase batch sizes for long-context jobs due to lower TPOT and KV memory, improving throughput and reducing per-request cost.
- Tools/products/workflows: Autoscaling policies optimized for Kimi Linear’s lower memory footprint; queueing/server scheduling tuned to sustained long-sequence workloads.
- Assumptions/dependencies: Network I/O and storage pipelines don’t become bottlenecks; attention layer mix remains compatible with downstream features.
On-device or edge LLMs with larger context windows (mobile, embedded, privacy-first apps)
- What: Run significantly longer contexts on limited-memory GPUs (e.g., 24–48GB) using reduced KV cache, enabling private long-form assistants and document analysis.
- Tools/products/workflows: Quantization-aware deployment; edge inference with longer transcripts or offline knowledge bases.
- Assumptions/dependencies: Quantization compatibility with KDA kernels; thermal/power constraints; sufficient disk bandwidth for large contexts.
Long-transcript speech and translation (contact centers, media)
- What: Translate and summarize multi-hour meeting or call recordings in single-session contexts without aggressive chunking.
- Tools/products/workflows: Streaming ASR→LLM pipelines that accumulate long histories; meeting minute generators with better topic coherence.
- Assumptions/dependencies: ASR quality; careful prompt design to avoid attention dilution; token budget and latency constraints.
Code assistants for repository-scale reasoning (software engineering)
- What: Navigate and reason over large codebases (monorepos) with fewer roundtrips, improving refactoring/summarization across many files.
- Tools/products/workflows: IDE plugins leveraging extended contexts; build scripts attaching logs, configs, and diffs in single passes.
- Assumptions/dependencies: Accurate code indexing remains critical; prompt engineering for repository navigation; memory overheads for very large repos.
Academic benchmarking and long-sequence methods research (academia)
- What: Evaluate new long-context training and inference strategies using released kernels and checkpoints while controlling for token budgets.
- Tools/products/workflows: vLLM-based experiments; custom kernel profiling; reproducible comparisons on MMLU/RULER-style benchmarks.
- Assumptions/dependencies: Comparable data and training recipes if retraining; awareness of numerical precision considerations in fine-grained decay.
Faster RL-style post-training loops (software agents, simulation)
- What: Speed up inference-bound RL fine-tuning and test-time rollouts (e.g., self-play, environment interaction) by reducing per-step compute cost.
- Tools/products/workflows: RLHF/RLAIF setups with longer trajectories; simulated environments with extended logs and plans.
- Assumptions/dependencies: Stable environment interfaces; reward model throughput; safety constraints in exploration.
Compliance, audit, and eDiscovery at scale (policy, legal, enterprise risk)
- What: Scan long regulatory artifacts or archives in fewer passes to produce audit trails and redaction-aware summaries.
- Tools/products/workflows: Audit pipelines feeding entire policy documents; hashing/traceability frameworks for audit-ready outputs.
- Assumptions/dependencies: Governance for sensitive data; human-in-the-loop verification; reproducible inference settings.

Long-Term Applications

These applications are promising but need further research, scaling, or engineering—especially in multimodal integration, safety, and domain validation.

Real-time embodied robotics and autonomous systems (robotics)
- What: Leverage faster, stateful linear attention for long-horizon planning and control policies at inference time.
- Tools/products/workflows: On-robot planners with large temporal contexts; hybrid perception→language→action stacks.
- Assumptions/dependencies: Multimodal extensions of KDA to vision/audio; hardened real-time kernels; safety certification.
Longitudinal patient modeling and clinical decision support (healthcare)
- What: Model multi-year EMR histories and physician notes in single contexts to improve care recommendations.
- Tools/products/workflows: Clinical summarizers with longitudinal memory; pre-visit and post-visit assistants aggregating long narratives.
- Assumptions/dependencies: Rigorous clinical validation; HIPAA-compliant deployments; bias and safety audits; domain-tuned checkpoints.
Persistent personal tutors with multi-year memory (education, consumer apps)
- What: Maintain large, persistent context across sessions to deliver deeply personalized tutoring and paper planning.
- Tools/products/workflows: Memory-augmented tutoring agents; curriculum-aware reasoning traces spanning months.
- Assumptions/dependencies: Privacy-preserving memory stores; consent and retention policies; model alignment for pedagogy.
Green AI and energy efficiency in data centers (energy, cloud, sustainability policy)
- What: Reduce energy per generated token by adopting linear/hybrid attention across fleets serving long-context workloads.
- Tools/products/workflows: Carbon-aware schedulers; energy telemetry benchmarking Kimi Linear vs. softmax attention in production.
- Assumptions/dependencies: Vendor support; standardized energy reporting; workload composition with many long sequences.
Standardized linear-attention-friendly serving interfaces (software infrastructure)
- What: Define common caching/scheduling APIs across vLLM, TGI, Triton, and custom stacks to ease hybrid attention adoption.
- Tools/products/workflows: Cross-framework adapters; operator registries and kernel test suites.
- Assumptions/dependencies: Community/specification consensus; robust CI for numerical stability across precisions.
Financial analysis and simulation with longer horizons (finance)
- What: Long-horizon forecasting and RL-style policy simulation using extended textual/market contexts.
- Tools/products/workflows: Scenario simulators with extended narratives; compliance-integrated agent loops.
- Assumptions/dependencies: Guardrails for risk and hallucination; regulatory acceptance; domain-specific tuning.
Massive multi-agent simulation and planning (research, operations)
- What: Scale agent counts and trajectory lengths via lower TPOT and KV memory, enabling richer emergent behaviors.
- Tools/products/workflows: Multi-agent orchestration platforms with shared long contexts; collective reasoning experiments.
- Assumptions/dependencies: Coordination protocols; resource isolation; evaluation of robustness under large-scale interaction.
Repository-wide code refactoring and synthesis (software)
- What: Generate coherent refactors or design docs across very large codebases with fewer context breaks.
- Tools/products/workflows: “Repo-scale refactor” assistants; synthesis of architecture decisions over long histories.
- Assumptions/dependencies: High-fidelity code understanding; safeguards against unsafe changes; developer acceptance.
Secure, on-device assistants with very long windows (consumer devices)
- What: Private assistants that keep large personal knowledge bases locally and operate offline.
- Tools/products/workflows: Distilled/quantized Kimi Linear variants; local RAG over personal archives.
- Assumptions/dependencies: Distillation to smaller models without quality loss; hardware constraints; local indexing.
Legal drafting, policy codification, and autoformalization (policy, legal)
- What: Turn extensive legislative records and case law into structured analyses and draft proposals in one reasoning pass.
- Tools/products/workflows: Policy drafting copilots; legislative codification assistants managing large contextual evidence.
- Assumptions/dependencies: Human oversight; provenance tracking; jurisdiction-specific validation; ethical and bias controls.

Notes on feasibility across applications:

Quality results in the paper rely on matched training recipes (e.g., 1.4T tokens). Deployments using released checkpoints can still benefit immediately, but task parity may vary without domain fine-tuning.
Performance depends on hardware (GPU type, precision) and careful kernel configuration; extreme context lengths require vigilance around numerical stability and prompt design.
The 3:1 KDA-to-global attention ratio preserves global information flow; altering it may affect quality/efficiency trade-offs.
Safety, alignment, and privacy considerations are critical for domain-sensitive sectors (healthcare, finance, policy).

View Paper Prompt View All Prompts

Glossary

Associative memory: A memory structure that stores key–value mappings for rapid retrieval within sequence models. "serves as an associative memory storing transient mappings from keys to values."
Chunkwise algorithm: An approach that processes sequences in fixed-length blocks to improve parallelism and hardware efficiency. "Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices"
Comba: A formulation used to efficiently pack and compute series of rank‑1 updates without extra matrix inversions. "We follow the formulation of $\mathbf{P}$ in Comba #1{hu2025comba} to reduce the need for an additional matrix inversion"
Delta rule: A classical online update rule that adjusts fast weights to minimize reconstruction error of key–value mappings. "the classical delta rule"
DeltaNet: A linear attention variant that performs online gradient descent on a reconstruction objective to stabilize updates. "DeltaNet~#1{schlag-2021-deltanet} reinterprets this recurrence as online gradient descent on a reconstruction objective:"
Diagonal-Plus-Low-Rank (DPLR): A matrix parameterization combining diagonal and low-rank components to model transitions efficiently. "Diagonal-Plus-Low-Rank (DPLR) transition matrices"
Diagonalized gate: A fine-grained, per-channel decay mechanism implemented as a diagonal matrix to control memory forgetting. "by introducing a fine-grained diagonalized gate"
Fast-weight perspective: A viewpoint treating the attention state as rapidly updated parameters storing temporary associations. "From the fast-weight perspective"
Finite-state RNN memory: The limited memory capacity inherent to recurrent models that constrains long-sequence expressivity. "limited finite-state RNN memory."
Gated DeltaNet (GDN): An extension of DeltaNet that adds a forget gate to implement controlled decay of fast weights. "Gated DeltaNet (GDN)~#1{yang-2025-gdn} introduces a scalar forget gate"
Gated Linear Attention (GLA): A linear attention variant with channel-wise gating for finer control over decay and memory. "akin to Gated Linear Attention (GLA)~#1{yang-etal-2024-gla}"
Householder transformation: A structured rank‑1 update used to represent reflections, here relating to efficient fast-weight updates. "equivalent to a generalized Householder transformation"
Inter-block recurrent and intra-block parallel strategy: A decoding strategy that updates states across blocks recurrently while computing within blocks in parallel for throughput. "we adopt an inter-block recurrent and intra-block parallel strategy to maximize matrix multiplication throughput"
Kimi Delta Attention (KDA): The paper’s proposed linear attention mechanism with fine-grained gating and efficient chunkwise computation. "We propose Kimi Delta Attention (KDA), a new gated linear attention variant"
KV cache: The stored keys and values from prior tokens used to speed up autoregressive decoding in attention models. "reducing KV cache usage by up to 75\%"
Linear attention: An attention mechanism that replaces quadratic softmax attention with linear-time updates to a recurrent state. "Linear attention~#1{katharopoulos-2020-transformers} maintains a matrix-valued recurrent state"
Lower-triangular mask: A masking scheme that enforces causal structure by allowing attention only to past positions. "denote lower-triangular masks with and without diagonal elements"
Multi-Head Latent Attention (MLA): A full-attention baseline architecture used for comparison in the paper. "based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA)."
Multiplicative positional encoding: A position representation technique applied multiplicatively to states or queries/keys, enabling relative positioning. "GDN can be interpreted as a form of multiplicative positional encoding"
Online gradient descent: A sequential optimization method updating parameters step-by-step as new data arrives. "reinterprets this recurrence as online gradient descent"
Pareto-optimal: A performance–efficiency tradeoff point where improving one metric would worsen the other. "it is Pareto-optimal, achieving top performance (84.3) and $3.98\times$ acceleration."
Rank-1 update: An efficient matrix update formed by the outer product of two vectors, used to modify fast weights. "The rank-1 update structure, equivalent to a generalized Householder transformation, supports hardware-efficient chunkwise parallelization"
RoPE (Rotary Positional Embedding): A mechanism that encodes relative positions via rotations, often assumed to preserve orthogonality. "relaxing the orthogonality constraint of RoPE"
StrictTril: A strict lower-triangular masking operator that excludes the diagonal elements. "we also write them as $\operatorname{Tril}$ and $\operatorname{StrictTril}$ ."
Tensor Cores: Specialized GPU units optimized for high-throughput matrix operations, crucial for efficient training/inference. "thereby fully utilizing the computational potential of Tensor Cores."
Test-time scaling: Increasing model capabilities at inference by processing longer trajectories or complex interactions. "RL test-time scaling"
Time per output token (TPOT): A decoding efficiency metric measuring average time spent generating each token. "Time per output token (TPOT) vs. decoding length."
UT transform: A triangular-matrix transformation used to reduce non-matrix-multiplication FLOPs in the chunkwise algorithm. "We apply the UT transform #1{joffrain-2006-ut} to reduce non-matmul FLOPs"
vLLM: A high-throughput LLM inference framework used for integrating and serving the proposed kernels. "open-source KDA kernels with vLLM integration"
WY Representation: A compact representation that packs multiple rank‑1 updates into a single structured form. "WY Representation is typically employed to pack a series rank-1 updates into a single compact representation"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (60)

First 10 authors:

Collections

Tweets

This paper has been mentioned in 10 tweets and received 423 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

HackerNews

Kimi Linear: An Expressive, Efficient Attention Architecture (6 points, 0 comments)

The first linear attention mechanism O(n) that outperforms modern attention O(n^2). 6× Faster 1M-Token Decoding and Superior Accuracy (63 points, 6 comments)
"Kimi Linear: An Expressive, Efficient Attention Architecture", Kimi Team 2025 (29 points, 2 comments)

Kimi Linear: An Expressive, Efficient Attention Architecture (2510.26692v1)

Summary

Kimi Linear: An Expressive, Efficient Attention Architecture

Introduction

Architecture Design

Implementation Details

Performance and Comparative Analysis

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

1) What this paper is about

2) What questions the researchers asked

3) How they did it (methods explained simply)

4) What they found and why it matters

5) What this could lead to (impact)

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (60)

Collections

Tweets

YouTube

HackerNews

Reddit