FlashMLA: Optimized Kernels for MLA

Updated 28 September 2025

FlashMLA is a family of kernel implementations that accelerates Multi-Head Latent Attention (MLA) by optimizing compute and memory bandwidth during transformer inference.
It employs a hybrid naive-absorb approach, as demonstrated by TyphoonMLA, to address both compute and memory bottlenecks, achieving up to 3× throughput improvements.
FlashMLA significantly reduces HBM memory reads, making it highly effective for LLM serving pipelines and scalable autoregressive decoding.

FlashMLA refers to the family of kernel implementations and software frameworks designed to accelerate Multi-Head Latent Attention (MLA) in transformer architectures, especially for LLMs such as DeepSeek-v3 and Kimi K2. MLA is a generalized attention mechanism enabling more scalable weight sharing and parallelization, compared to standard Multi-Head Attention (MHA). FlashMLA’s primary objective is to address the bandwidth and compute-bottlenecks in high-throughput, low-latency transformer inference, particularly during the autoregressive decode stage with long KV context and short query sequences. Successive improvements in FlashMLA and its derivatives focus on combining the benefits of naive and absorb kernel formulations, optimizing hardware utilization, and reducing memory bandwidth requirements.

1. Multi-Head Latent Attention and Kernel Formulations

Multi-Head Latent Attention (MLA) generalizes attention by factorizing projection matrices, enabling both compact storage and flexible representation learning. Kernel implementations for MLA typically fall into two categories:

Naive Formulation: The key–value (KV) cache is stored in an uncompressed format. Rotational positional embeddings (RoPE) and RMS normalization are applied, followed by an up-projection to the full latent space before attention computation. This approach is compute-efficient, as the per-token computation resembles standard MHA. However, it is memory-bound due to frequent high-bandwidth memory (HBM) reads of the uncompressed cache.
Absorb Formulation: By exploiting the commutative property of matrix multiplication, the up-projection is split and partially absorbed into adjacent layers (either before or after the query/output projections). In the absorb method, the KV cache is maintained in a compressed latent space, substantially reducing the bandwidth required for HBM reads. The absorb implementation is compute-bound and does not fully exploit opportunities for data reuse when the context is shared across multiple queries.

FlashMLA predominantly uses the absorb kernel in the decode stage to minimize HBM bandwidth, which becomes the primary bottleneck at scale.

2. TyphoonMLA: Hybrid Naive-Absorb Approach

TyphoonMLA introduces a mixed kernel approach, selectively combining the naive and absorb formulations to exploit diverse hardware bottlenecks and data reuse patterns:

For shared prefix KV cache regions (common in scenarios with long system prompts or speculative decoding), TyphoonMLA uses the naive formulation to maximize throughput where data reuse is prevalent and batch sizes are large. The lower floating-point operation count in compute-bound regions provides superior performance with sufficiently reused context.
For the non-shared regions of KV cache (unique to each query and hence memory-bound), TyphoonMLA applies the absorb kernel to minimize bandwidth usage.
The hybrid MAC formula is:

$\text{MACs}_{\text{TyphoonMLA}} = B L_s H(D_{qk} + D_v) + B L_n H(2 D_l + D_r)$

where $B$ is batch size, $L_s$ and $L_n$ are shared and non-shared context lengths, $H$ is number of heads, and $D_{qk}$ , $D_v$ , $D_l$ , $D_r$ are dimensions relevant to the query/key, value, latent, and RoPE spaces.

At small batch sizes or without substantial shared context, TyphoonMLA defaults to the absorb-only implementation to avoid overhead.

3. Bandwidth Optimization and Hardware Implications

The critical practical requirement in LLM inference is minimizing HBM bandwidth, especially during sequential token decoding. FlashMLA absorb kernels store the KV-cache compressed—reducing memory reads by up to two orders of magnitude compared to naive. TyphoonMLA further refines this by maintaining the shared prefix region in uncompressed form (to enable efficient compute-bound reuse) while storing the non-shared remainder in compressed format (to minimize HBM traffic).

Experimental findings indicate that TyphoonMLA reduces HBM data reads for the non-shared context by approximately 70× relative to naive, with just ~3% additional memory footprint even in large-scale deployments.

The ability to move between compute-bound and memory-bound kernel executions depending on context reuse allows TyphoonMLA to maximize throughput while keeping memory consumption low.

4. Performance Gains

Benchmarks and roofline analysis demonstrate:

On Ascend NPUs, TyphoonMLA delivers up to 3× throughput improvement over baseline MLA kernels.
On modern GPUs, TyphoonMLA achieves up to 3.24× higher throughput compared to FlashMLA and absorb-only implementations.
Performance scales upwards (up to 3.4× improvement) as shared context grows, with the naive formulation more effectively exploiting data reuse.
No accuracy degradation is observed; TyphoonMLA is functionally equivalent to absorb-only kernels at the model output level.

The method thus simultaneously addresses both compute and memory bottlenecks, depending on input profile and hardware characteristics.

5. Real-World Applications and Integration

TyphoonMLA is particularly relevant for:

LLM serving pipelines with tree-of-thought or graph-of-thought decoding, speculative decoding, or long system prompts.
Frameworks such as vLLM and SGLang, which manage large batch inference with substantial shared prefixes.
Scenarios that benefit from parallelization strategies (tensor parallelism, sequence parallelism) where context reuse is high.

The design is operationally compatible with current autoregressive inference systems and can be integrated with custom hardware schedulers for further throughput improvement.

6. Future Directions and Broader Impact

The success of TyphoonMLA underscores the need for hybrid and context-aware kernel strategies tailored to workload and hardware characteristics. Prospective research avenues include:

Dynamic kernel switching or probabilistic context partitioning for fine-grained control over compute/memory utilization.
Extension to custom or ASIC hardware, adapting kernel implementations for advanced attention mechanisms beyond LLMs.
Co-designing inference frameworks that delegate prefix and suffix regions of context to specialized compute clusters.
Reducing operational costs and energy consumption in data center deployments by optimizing memory-bound steps and leveraging compute-bound operations where possible.

The hybrid naive-absorb approach exemplified by TyphoonMLA may drive a new class of hardware-aware kernel optimizations essential for next-generation scalable LLM serving.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to FlashMLA.