Log-Linear Attention (2506.04761v2)

Published 5 Jun 2025 in cs.LG

Abstract: The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.

Summary

The paper introduces a log-linear attention mechanism that extends the hidden state logarithmically using Fenwick tree-based hierarchical partitioning.
It achieves efficient O(log T) inference with a chunkwise parallel algorithm that decomposes intra- and inter-chunk computations.
Experimental results demonstrate improved long-context performance and training throughput compared to standard linear and Transformer models.

The paper "Log-Linear Attention" (2506.04761) introduces a novel attention mechanism designed to address the limitations of existing efficient attention variants, such as linear attention and state-space models (SSMs), particularly their struggle with long contexts due to a fixed-size hidden state. Log-linear attention aims to strike a balance between the quadratic complexity and high expressiveness of standard softmax attention and the linear complexity and limited expressiveness of fixed-state models.

The core idea is to replace the single fixed-size hidden state used in linear attention/SSMs with a set of hidden states whose size grows logarithmically with the sequence length ( $O(\log T)$ ). This allows the model to maintain context information at multiple temporal scales.

The paper frames efficient attention variants under a unified equation: $Y = (Q K^\top \odot M) V$ , where $M$ is a lower-triangular causal masking matrix. Different efficient attention models (Linear Attention, RetNet, Mamba-2, DeltaNet, Gated DeltaNet, Hyena) are shown to correspond to different structures imposed on the matrix $M$ and the base interaction $Q K^\top$ (or its generalized form). The structure of $M$ is crucial for enabling efficient training and inference algorithms.

Log-linear attention modifies the matrix $M$ to have a hierarchical structure, specifically based on a Fenwick tree partitioning of the sequence. This partitioning scheme divides the prefix $[0, t)$ for a given time step $t$ into disjoint segments, where recent segments are smaller (finer granularity) and older segments are larger (coarser granularity). For a query at position $t$ , the model attends to information from up to $O(\log t)$ segments. Each segment maintains its own recurrent memory, and the contributions from these memories are weighted by data-dependent scalars $\lambda_t^{(\ell)}$ (one for each level $\ell$ ), allowing the model to adaptively focus on different temporal scales.

The recurrent form for inference for log-linear attention involves computing an output $y_t$ based on a weighted sum of contributions from $O(\log t)$ hidden states, where each hidden state $S_t^{(\ell)}$ summarizes information from a specific segment defined by the Fenwick tree partitioning: $y_t = \sum_{\ell = 0}^{L-1} \lambda_t^{(\ell)} q_t^\top S_{t}^{(\ell)}$ . The hidden states $S_{t}^{(\ell)}$ are updated based on the current token and previous states, following a recurrence inspired by Fenwick tree updates. This structure ensures that decoding can be performed with $O(\log T)$ time and $O(\log T)$ memory complexity per step.

For training, the paper shows that the log-linear attention computation can be reformulated into a parallel form involving a structured matrix $M^\mathcal{H}$ , where $M^\mathcal{H}_{ts} = \lambda_t^{\ell(t,s)}$ if $s \le t$ and 0 otherwise. Here, $\ell(t, s)$ is the level of the segment containing token $s$ for the query at time $t$ under the Fenwick partitioning. This matrix $M^\mathcal{H}$ is identified as a lower-triangular instance of a quasi-hierarchical ( $\mathcal{H}$ ) matrix.

An efficient chunkwise parallel algorithm is developed to compute $Y = (Q K^\top \odot M^\mathcal{H}) V$ during training. The algorithm decomposes the computation based on the hierarchical structure of $M^\mathcal{H}$ :

Intra-chunk computations: Handles interactions within predefined chunks of length $C$ . This involves block-diagonal parts of $M^\mathcal{H}$ and can be computed efficiently using standard matrix multiplications within each chunk, resulting in $O(TC)$ complexity.
Inter-chunk computations: Handles interactions between chunks. This is achieved by viewing the hierarchical structure as multiple levels of dependencies between chunks. Each level corresponds to a computation involving a sequentially semi-separable (SSS) matrix. The algorithm performs a chunkwise parallel scan, which requires $O(\log(T/C))$ invocations of a linear-time state-passing primitive. This results in a total cost of $O(T \log T)$ for inter-chunk computations. The overall training algorithm thus achieves $O(T \log T)$ time complexity and $O(T)$ memory complexity.

The log-linear attention framework is presented as general and applicable to existing linear attention models. The paper demonstrates this by creating log-linear variants of Mamba-2 and Gated DeltaNet. These variants compose the original model's attention mask structure ( $M^\mathcal{S}$ for Mamba-2/Gated DeltaNet) with the log-linear hierarchical mask ( $M^\mathcal{H}$ ), resulting in an effective mask $M = M^\mathcal{S} \odot M^\mathcal{H}$ . The parallel forms for these log-linear variants are: Log-Linear Mamba-2: $Y = (Q K^\top \odot M^\mathcal{S} \odot M^\mathcal{H}) V$ Log-Linear Gated DeltaNet: $Y = (\mathcal{T}(Q, K) \odot M^\mathcal{S} \odot M^\mathcal{H}) V$ , where $\mathcal{T}(Q, K)$ represents the DeltaNet-specific base interaction.

Practical implementation considerations include the development of custom Triton kernels for the chunkwise parallel scan algorithm to optimize performance on modern hardware. The implementation fuses computations across multiple levels and optimizes the backward pass. The paper reports that a custom kernel for Log-Linear Mamba-2 outperforms FlashAttention-2 for sequence lengths beyond 8K in terms of kernel runtime, and the overall model achieves higher training throughput than Transformers at 32K sequence length.

Experimental results are presented across synthetic and LLMing tasks:

MQAR (Synthetic): Log-Linear DeltaNet maintains high accuracy on multi-query associative recall as sequence length increases, where linear DeltaNet degrades.
LLMing (Pretraining): On standard short-context benchmarks (WikiText PPL, zero-shot commonsense), log-linear variants perform comparably or slightly better than their linear counterparts.
Per-Position Loss: Analyzing loss across token positions in long documents (Book3) shows that log-linear variants consistently reduce loss compared to linear variants, indicating improved long-range context utilization.
Needle-In-A-Haystack (NIAH): Log-linear variants generally show improved performance over linear counterparts on single- and multi-needle retrieval tasks at longer sequence lengths.
In-Context Retrieval (BASE, LongBench): Log-Linear Gated DeltaNet shows consistent gains over its linear counterpart on several retrieval and long-context understanding tasks, while Log-Linear Mamba-2 shows improvements on roughly half of the tasks. A performance gap compared to Transformers still remains on several benchmarks.

Limitations discussed include the fact that log-linear attention does not uniformly improve performance on all tasks compared to linear baselines, potentially due to suboptimal hyperparameter choices. The engineering complexity is higher due to custom kernel development. The Fenwick tree partitioning introduces an inductive bias prioritizing recent context, which might not be optimal for all applications.

In summary, Log-Linear Attention proposes a principled way to extend linear attention/SSMs with a logarithmically growing state by leveraging hierarchical matrix structures (specifically, quasi- $\mathcal{H}$ matrices derived from Fenwick tree partitioning). This enables $O(\log T)$ inference and $O(T \log T)$ training efficiency while enhancing the model's ability to capture long-range dependencies compared to fixed-state linear models, showing promising empirical results on various tasks requiring long-context understanding and retrieval.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HanGuo97/status/1930789854205616384

https://twitter.com/TheTuringPost/status/1931432659022209109

https://twitter.com/hillbig/status/1932194831159513325

https://twitter.com/rohanpaul_ai/status/1933584179582906554

https://twitter.com/hillbig/status/1932195495780532443

https://twitter.com/gmongaras/status/1931902518940053775

YouTube

Show All Videos

HackerNews

Log-Linear Attention (41 points, 3 comments)