Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Latent Attention Architecture

Updated 2 January 2026
  • Multi-Latent Attention (MLA) is a low-rank, projection-driven mechanism that compresses key/value caches to dramatically reduce memory overhead in Transformer models.
  • It leverages learned projections and kernel-level optimizations such as naive, absorb, and TyphoonMLA formulations to shift from memory-bound to compute-balanced inference.
  • MLA integrates with tensor- and data-parallel strategies, enabling scalable long-context and high-batch decoding in state-of-the-art systems like DeepSeek-v3 and Kimi K2.

Multi-Latent Attention (MLA) architecture is a family of low-rank, projection-driven attention mechanisms for Transformer-based models, in which key, value, and occasionally query tensors are projected into a shared latent space. This architectural shift enables radical reductions in per-token KV-cache memory, achieves substantial gains in compute and memory bandwidth efficiency, and is now integral to state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. MLA is operationally and numerically equivalent to standard attention, but exposes a new set of kernel-level and system-level optimizations, including naive, absorb, and mixed formulations, as well as tensor-parallel extensions. This has shifted LLM inference from a bandwidth-dominated paradigm toward compute-balanced regimes, enabling practical long-context, high-batch decoding on modern AI accelerators.

1. Foundations and Mathematical Formulation

MLA generalizes standard Multi-Head Attention (MHA) by compressing the key/value (KV) representations through learned low-rank projections. The canonical block, instantiated in DeepSeek-v3 and related models, operates as follows:

Given input xt∈Rdx_t \in \mathbb{R}^d (the token embedding at time tt), with projection operators:

  • Down-project: WQa,WQb,WKVa,WKVbW_{Qa}, W_{Qb}, W_{KVa}, W_{KVb}
  • Latent dimension (LoRA rank): Dl≪dD_l \ll d
  • RoPE-encoded dimension: DrD_r
  • Number of heads: HH, head dimensions dqk,dvd_{qk}, d_v

Naive (decompress-then-attend) MLA:

qt=xtWQa∈RDl,kvt=xtWKVa∈RDlq_t = x_t W_{Qa} \in \mathbb{R}^{D_l}, \quad kv_t = x_t W_{KVa} \in \mathbb{R}^{D_l}

Qt=qtWQb∈RH×dqk,K~=[kv1;…;kvt−1]WKVb∈R(t−1)×H×dqkQ_t = q_t W_{Qb} \in \mathbb{R}^{H \times d_{qk}}, \quad \tilde{K} = [kv_1; \ldots; kv_{t-1}] W_{KVb} \in \mathbb{R}^{(t-1)\times H \times d_{qk}}

A=Softmax(QtK~Tdqk)V~A = \mathrm{Softmax}\left(\frac{Q_t \tilde{K}^T}{\sqrt{d_{qk}}}\right) \tilde{V}

Absorb (commute up-projection into Q/O) MLA:

WKVb=WKVb1â‹…WKVb2W_{KVb} = W_{KVb1}\cdot W_{KVb2}

Qt′=(xtWQa⊙RoPE)WQb∈RH×DrQ'_t = (x_t W_{Qa} \odot \text{RoPE}) W_{Qb} \in \mathbb{R}^{H \times D_r}

Output:Z=Softmax(Qt′(PE+noPE)TDr)⋅noPE\mathrm{Output:} \quad Z = \mathrm{Softmax}\left(\frac{Q'_t (PE + noPE)^T}{\sqrt{D_r}}\right) \cdot noPE

Then expand  Z  with  WKVb2→dv\text{Then expand} \; Z \; \text{with} \; W_{KVb2} \rightarrow d_v

MLA stores only the latent tensor(s) for past tokens (Dl≪dD_l \ll d), rather than full-dimension KV pairs, effecting an O(r/d)O(r/d) reduction in inference memory and bandwidth (Yüzügüler et al., 25 Sep 2025, Mehta et al., 11 Jun 2025, Ji et al., 20 Feb 2025).

2. Cache Compression and Arithmetic Intensity

MLA replaces the O(Ld)O(Ld) per-sequence storage of standard MHA with O(LDl)O(LD_l), reducing memory costs by factors up to 10×--15× in large-scale deployments. The core mechanism is a learned low-rank factorization of the key and value projections (and occasionally queries), with separate RoPE and NoPE channels to maintain positional fidelity and content expressivity (Ji et al., 20 Feb 2025, Meng et al., 11 Feb 2025, Li et al., 14 Mar 2025).

Model Standard MHA KV Cache MLA KV Cache Typical Reduction
Qwen-2.5-7B $2 L d$ $2 L r$ (e.g., r=512≪dr=512 \ll d) 85.7%85.7\% reduction
DeepSeek-v3 $2 L d$ LDlL D_l (Dl=512≪dD_l = 512 \ll d) 14×14 \times

MLA increases the arithmetic intensity (Op/B) of core attention from ≈1\approx 1 (MHA, memory bound) to $100$–$200$ (MLA, compute bound), saturating modern GPU compute and mitigating bandwidth bottlenecks during long-context decoding (Yun et al., 21 Jul 2025, Geens et al., 3 Jun 2025).

3. Kernel Implementations: Naive, Absorb, and TyphoonMLA Hybrid

The MLA kernel admits two numerically equivalent but computationally distinct modes:

  • Naive (decompress-then-attend): decompresses the full K/V per token via up-projection before attention. Preferred in prefill and for shared context due to high compute reuse.
  • Absorb (commute-up projection): delays KV expansion, applying up-projection to the post-softmax context vector. Minimizes HBM bandwidth by operating directly on the compact latent.
  • TyphoonMLA Hybrid: partitions the sequence into shared prefix and non-shared suffix, running the naive kernel on the highly-reused shared prefix (efficient for large batch BB), and absorb for the unique suffix. This hybrid achieves significant throughput improvements (up to 3.24×3.24 \times on GPUs, 3×3 \times on NPUs) and only 3%3\% additional HBM usage compared to pure absorb (Yüzügüler et al., 25 Sep 2025).
Kernel MAC Dominance HBM Traffic Dominance Strength
Naive Memory-bound O((Ls+BLn)d)O((L_s + BL_n)d) Training, shared prefix decode
Absorb Compute-bound O((Ls+BLn)r)O((L_s + BL_n)r) Bandwidth-constrained decode
TyphoonMLA Mixed, dynamic cutover LsL_s naive, LnL_n absorb Harnesses compute and bandwidth

The TyphoonMLA algorithm dynamically selects the optimal execution regime based on batch size and shared prefix length, effectively exploiting hardware rooflines without sacrificing numerical equivalence or model expressivity (Yüzügüler et al., 25 Sep 2025).

4. Integration with Parallelism and Systems

MLA is compatible with advanced tensor parallel (TP) strategies and system-level optimizations:

  • Tensor-Parallel Latent Attention (TPLA): splits both the latent representations and head dimension across devices, preserving the low-rank cache per device and enabling efficient all-reduce communication. Orthogonal or PCA-based transforms are applied to minimize slicing distortion. TPLA delivers up to 1.9×1.9 \times throughput improvement at $32$K context with minimal (<$0.2$ pp) accuracy impact (Tang et al., 21 Aug 2025).
  • Data-Parallel (DP) Preference: batched GEMMs on shared latents maximize arithmetic intensity, favoring DP over head-wise TP in maintaining high utilization.
  • Interactions with PagedAttention, FlashAttention, RadixAttention: MLA/absorb/naive/TPLA drop in to these high-throughput batching and cache extension strategies, retaining cache savings and compute intensity (Yüzügüler et al., 25 Sep 2025).

5. Empirical Results and Hardware Implications

Multiple evaluations across GPU (NVIDIA B200, H20, A100) and NPU (Ascend) clusters demonstrate the performance benefits:

  • DeepSeek-v3 / Kimi K2, TyphoonMLA: up to 3.24×3.24\times throughput gain at prompt lengths >20>20k, with batch sizes up to $1024$ (Yüzügüler et al., 25 Sep 2025).
  • FlashMLA-ETAP: on H20 GPUs, achieves up to 2.78×2.78\times over previous kernels, and 5.2×5.2\times over FlashAttention-3 at $64$k context, with 15.2×15.2\times lower RMSE due to favorable numerical transpositions of BMMA tiles (Dege et al., 13 May 2025).
  • Energy and efficiency modeling: MLA enables nearly flat throughput scaling with cache length, and on compute-rich accelerators, can achieve 2×2\times throughput and 30%30\% lower energy per token relative to MHA (Geens et al., 3 Jun 2025).
  • Cache compression at scale: In Qwen-2.5-7B with D=3584D=3584, r=512r=512, MLA compresses the cache to 14.3%14.3\% (or 85.7%85.7\% reduction). With additional quantization, >92%>92\% total cache reduction is feasible (Meng et al., 11 Feb 2025, Ji et al., 20 Feb 2025, Cai et al., 20 Sep 2025).

6. Model Conversion and Fine-Tuning Pipelines

MLA can be adopted post-hoc via:

  • Joint SVD Decomposition: Given pre-trained MHA or GQA projections, perform rank-truncated SVD to factor full-rank weights into down/up-projections, including joint-KV compressions (Li et al., 14 Mar 2025, Ji et al., 20 Feb 2025).
  • Partial-RoPE Integration: Select the most positionally informative sub-bands for explicit RoPE, compress the remaining (NoPE) dimensions jointly for content (Ji et al., 20 Feb 2025).
  • Minimal Fine-Tuning: Models migrated via SVD require only $0.3$–0.6%0.6\% of the original corpus for fine-tuning, regaining >99%>99\% of task performance and yielding KV cache reductions of $80$–95%95\%. Compound with quantization, reductions reach 97%97\% (Ji et al., 20 Feb 2025, Li et al., 14 Mar 2025).
Method Data for Recovery Cache Reduction Performance Delta
SVD+FT (1B) ∼\sim3–7B tokens 6.4×6.4\times (to 15.6%15.6\% baseline) <1<1 pt
Joint SVD+Partial-RoPE ∼0.6%\sim0.6\% corpus 92%92\% (Llama2-7B) −0.5%-0.5\% LongBench

7. Theoretical and Practical Considerations

  • Spectral Capacity: Random matrix theory analyses reveal spectral spikes and rank collapse in naive MLA and PreRoPE, solved by "Decoupled MLA"—splitting latent space into content and shared RoPE subspaces, and sharing the latter across heads (Jha et al., 12 Jul 2025). This avoids expressivity loss common in naive low-rank compression.
  • Architectural Tuning: Choosing the latent dimension rr is critical. For small models, r=dk/2r=d_k/2 gives a 45%45\% memory reduction and 1.4×1.4\times speedup with minimal quality loss. Reducing rr below dk/4d_k/4 yields rapid degradation (Mehta et al., 11 Jun 2025).
  • Extensions: Embedding-gated MLA (EG-MLA) injects per-token embedding gating into the latent up-projection, enabling quadratic feature expansion and further compressing the cache by up to 59.9%59.9\% over baseline MLA, while maintaining or improving accuracy (Cai et al., 20 Sep 2025).
  • Hybrid Attention in Other Domains: PointLAMA leverages MLA blocks (point-wise latent attention) in combination with state-space Mamba layers for efficient global-local information aggregation in point cloud modeling, highlighting the generality of the low-rank MLA framework (Lin et al., 23 Jul 2025).

8. Best Practices and System-Level Integration

  • Utilize TyphoonMLA or absorb-mode kernels for high-batch, long-prompt inference to maximize hardware utilization under memory bandwidth constraints.
  • Prefer TPLA or DP-centric scheduling for high throughput multi-accelerator deployments.
  • Tune the latent rank to hardware's operational intensity corner; use prefill-decode specialization and hybrid quantization for best practical efficiency.
  • For migration, apply joint-SVD and partial-RoPE to pre-trained checkpoints, and retrain only the low-rank factors for 1–3 epochs.
  • Combine MLA with existing inference accelerators (paged attention, quantization, multi-token prediction) for compound efficiency gains, without altering model weights or training workflow (Yüzügüler et al., 25 Sep 2025, Geens et al., 3 Jun 2025, Ji et al., 20 Feb 2025).

In summary, Multi-Latent Attention architectures transform the computational profile of Transformer inference by projecting high-dimensional context into a compressed latent space, supporting aggressive KV-cache compression, compute-friendly attention kernels, and seamless integration with contemporary system-level optimizations. Recent advances—including TyphoonMLA, tensor-parallel extensions, and embedding gating—demonstrate that MLA is central to enabling scalable, efficient, and expressive LLM deployments at both the kernel and end-to-end system level (Yüzügüler et al., 25 Sep 2025, Meng et al., 11 Feb 2025, Yun et al., 21 Jul 2025, Li et al., 14 Mar 2025, Tang et al., 21 Aug 2025, Geens et al., 3 Jun 2025, Cai et al., 20 Sep 2025, Lin et al., 23 Jul 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Latent Attention (MLA) Architecture.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube