InfLLM v2: Efficient Sparse Attention for LLMs

Updated 30 August 2025

InfLLM v2 is a dynamic, trainable sparse attention mechanism that partitions queries into semantically relevant blocks for efficient long-context processing.
It employs a two-stage process—block retrieval and localized attention—to achieve up to 7x decoding speedup while maintaining high accuracy.
The design integrates differentiable kernel means without extra parameter overhead, making it ideal for scalable LLM deployment on end devices.

InfLLM v2 refers to a trainable sparse attention mechanism developed and deployed within the MiniCPM4 framework for ultra-efficient long-context LLM inference on end devices (Team et al., 9 Jun 2025). It replaces conventional dense self-attention with dynamically selected block-based attention, enabling efficient processing across both the prefilling and autoregressive decoding phases. InfLLM v2 is distinct both architecturally and algorithmically, targeting scalable long-context handling with minimal computational overhead and high practical throughput.

1. Trainable Sparse Attention Architecture

InfLLM v2 divides the traditional key–value cache of a transformer decoder into fixed-size blocks, allowing each query token to attend only to a subset of semantically relevant blocks, rather than the entire sequence. This subdivision is formally described as follows: given input sequence $X = \{x_1, x_2, \ldots, x_l\}$ , the KV cache is split into blocks $B_j$ of size $m$ :

$B_j = (K_{jm:(j+1)m}, V_{jm:(j+1)m})$

For finer semantic control, the key sequence is partitioned into overlapping semantic kernels $S_{\hat{j}}$ of size $p$ and stride $s$ :

$S_{\hat{j}} = K_{\hat{j} \cdot s : \hat{j} \cdot s + p}$

The semantic content of each kernel is summarized using mean pooling:

$r_{\mathrm{kernel}}(q_i, S_{\hat{j}}) = \mathrm{softmax}(q_i \cdot \mathrm{Mean}(S_{\hat{j}}))$

where $q_i$ denotes the query vector for token $i$ .

2. Dynamic Block Selection via Relevance Scoring

Each query computes relevance scores with all semantic kernels, and aggregates these to obtain block-level scores:

$r_{\mathrm{block}}(q_i, B_j) = \max_{S_{\hat{j}} \in \mathcal{S} \text{ with } B_j \cap S_{\hat{j}} \neq \emptyset} r_{\mathrm{kernel}}(q_i, S_{\hat{j}})$

For each query, the top- $k$ blocks are chosen based on $r_{\mathrm{block}}(q_i, B_j)$ , focusing computation on blocks most relevant to the semantic needs of the token.

3. Two-Stage Sparse Attention Process

Stage 1: Block Retrieval

For each query $q_i$ , compute relevance with $O\left(\frac{l}{s}\right)$ kernels, which provides an estimated semantic landscape for efficient block selection. This drastically reduces the computational requirements compared to dense attention ( $O(l)$ ).

Stage 2: Local Attention within Selected Blocks

Attention weights for $q_i$ are calculated only against tokens within the selected blocks:

$o_i = \mathrm{Attention}(q_i, \{\text{tokens in selected blocks}\})$

This yields a sparse attention map, with each query token typically attending to only 19% of the available tokens (81% sparsity), as reported.

4. Efficiency in Long-Context Prefilling and Decoding

By restricting attention to a limited number of blocks per query:

Prefilling phase: Entire sequence is filled in parallel, but computation per token is reduced.
Decoding phase: Each new token computes attention over $k$ blocks (each of $m$ tokens) rather than over all previous tokens. These properties enable efficient scaling for very long contexts (e.g., 32K or 128K tokens), with cost per token independent of sequence length. In practice, MiniCPM4 with InfLLM v2 achieves up to 7 $\times$ speedup in decoding on edge GPUs compared to Qwen3-8B, maintaining perfect accuracy on needle-in-a-haystack tests even at 128K context length (Team et al., 9 Jun 2025).

5. Algorithmic Differentiability and Integration

Unlike fixed sparse attention schemes or structures that introduce parameter-heavy non-semantic attention (e.g., NSA), InfLLM v2 uses an indirectly trainable mechanism; the kernel means are differentiable with respect to underlying KV projections, allowing end-to-end training alongside the base model. It employs:

Dynamic query head grouping: Query heads sharing the same top- $k$ blocks are processed collectively, reducing memory footprint.
Efficient top- $k$ approximations: Leveraging log-sum-exp strategies to avoid full memory passes during selection.

No extra output parameters are needed for this dynamic block selection, and no explicit parameterization of sparsity patterns is introduced beyond the base transformer.

6. Comparison with Prior and Alternative Sparse Attention Methods

InfLLM v2 differs from prior sparse methods in several key aspects: | Mechanism | Trainable? | Block Selection | Parameter Overhead | Efficiency | |---------------------|------------|-----------------|--------------------|------------| | Fixed block sparse | No | Static | None | Medium | | NSA (Non-Semantic) | Yes | Learned, non-semantic | High | Variable | | InfLLM v2 | Yes | Semantically dynamic | None | High |

InfLLM v2 abolishes static sparsity patterns and parameter-heavy block selector networks, instead utilizing token-key semantics for block decisions. This mechanism yields scalable and trainable efficient attention without compromising long-context understanding.

7. Benchmark Results and Impact

Within MiniCPM4, which combines InfLLM v2 with the CPM.cu system for quantization and speculative sampling, substantial empirical improvements are documented:

Comparable long-sequence modeling ability to dense attention at 81% sparsity.
7 $\times$ decoding speedups on Jetson AGX Orin devices relative to contemporaries.
100% accuracy retention on semantic retrieval tasks in lengthy contexts. This positions InfLLM v2 as a core enabler for ultra-efficient LLM deployment on edge devices and for diverse long-context AI applications.

8. Contextual Significance and Future Developments

InfLLM v2 redefines scalable attention computation through algorithmic differentiation, semantic block selection, and dynamic grouping—providing a computationally lightweight yet highly accurate long-context LLM backbone. Its architecture supports integration with advanced quantization schemes, pre-training strategy search, and data-efficient post-training methods. The demonstrated efficiency and effectiveness indicate that trainable sparse attention will likely be central in future advances for computationally constrained and real-time LLM deployments.

PDF Markdown Chat (Pro)

References (1)

MiniCPM4: Ultra-Efficient LLMs on End Devices (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to InfLLM v2.