Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Log-Linear Attention (2506.04761v2)

Published 5 Jun 2025 in cs.LG

Abstract: The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.

Summary

  • The paper introduces a log-linear attention mechanism that extends the hidden state logarithmically using Fenwick tree-based hierarchical partitioning.
  • It achieves efficient O(log T) inference with a chunkwise parallel algorithm that decomposes intra- and inter-chunk computations.
  • Experimental results demonstrate improved long-context performance and training throughput compared to standard linear and Transformer models.

The paper "Log-Linear Attention" (2506.04761) introduces a novel attention mechanism designed to address the limitations of existing efficient attention variants, such as linear attention and state-space models (SSMs), particularly their struggle with long contexts due to a fixed-size hidden state. Log-linear attention aims to strike a balance between the quadratic complexity and high expressiveness of standard softmax attention and the linear complexity and limited expressiveness of fixed-state models.

The core idea is to replace the single fixed-size hidden state used in linear attention/SSMs with a set of hidden states whose size grows logarithmically with the sequence length (O(logT)O(\log T)). This allows the model to maintain context information at multiple temporal scales.

The paper frames efficient attention variants under a unified equation: Y=(QKM)VY = (Q K^\top \odot M) V, where MM is a lower-triangular causal masking matrix. Different efficient attention models (Linear Attention, RetNet, Mamba-2, DeltaNet, Gated DeltaNet, Hyena) are shown to correspond to different structures imposed on the matrix MM and the base interaction QKQ K^\top (or its generalized form). The structure of MM is crucial for enabling efficient training and inference algorithms.

Log-linear attention modifies the matrix MM to have a hierarchical structure, specifically based on a Fenwick tree partitioning of the sequence. This partitioning scheme divides the prefix [0,t)[0, t) for a given time step tt into disjoint segments, where recent segments are smaller (finer granularity) and older segments are larger (coarser granularity). For a query at position tt, the model attends to information from up to O(logt)O(\log t) segments. Each segment maintains its own recurrent memory, and the contributions from these memories are weighted by data-dependent scalars λt()\lambda_t^{(\ell)} (one for each level \ell), allowing the model to adaptively focus on different temporal scales.

The recurrent form for inference for log-linear attention involves computing an output yty_t based on a weighted sum of contributions from O(logt)O(\log t) hidden states, where each hidden state St()S_t^{(\ell)} summarizes information from a specific segment defined by the Fenwick tree partitioning: yt==0L1λt()qtSt()y_t = \sum_{\ell = 0}^{L-1} \lambda_t^{(\ell)} q_t^\top S_{t}^{(\ell)}. The hidden states St()S_{t}^{(\ell)} are updated based on the current token and previous states, following a recurrence inspired by Fenwick tree updates. This structure ensures that decoding can be performed with O(logT)O(\log T) time and O(logT)O(\log T) memory complexity per step.

For training, the paper shows that the log-linear attention computation can be reformulated into a parallel form involving a structured matrix MHM^\mathcal{H}, where MtsH=λt(t,s)M^\mathcal{H}_{ts} = \lambda_t^{\ell(t,s)} if sts \le t and 0 otherwise. Here, (t,s)\ell(t, s) is the level of the segment containing token ss for the query at time tt under the Fenwick partitioning. This matrix MHM^\mathcal{H} is identified as a lower-triangular instance of a quasi-hierarchical (H\mathcal{H}) matrix.

An efficient chunkwise parallel algorithm is developed to compute Y=(QKMH)VY = (Q K^\top \odot M^\mathcal{H}) V during training. The algorithm decomposes the computation based on the hierarchical structure of MHM^\mathcal{H}:

  1. Intra-chunk computations: Handles interactions within predefined chunks of length CC. This involves block-diagonal parts of MHM^\mathcal{H} and can be computed efficiently using standard matrix multiplications within each chunk, resulting in O(TC)O(TC) complexity.
  2. Inter-chunk computations: Handles interactions between chunks. This is achieved by viewing the hierarchical structure as multiple levels of dependencies between chunks. Each level corresponds to a computation involving a sequentially semi-separable (SSS) matrix. The algorithm performs a chunkwise parallel scan, which requires O(log(T/C))O(\log(T/C)) invocations of a linear-time state-passing primitive. This results in a total cost of O(TlogT)O(T \log T) for inter-chunk computations. The overall training algorithm thus achieves O(TlogT)O(T \log T) time complexity and O(T)O(T) memory complexity.

The log-linear attention framework is presented as general and applicable to existing linear attention models. The paper demonstrates this by creating log-linear variants of Mamba-2 and Gated DeltaNet. These variants compose the original model's attention mask structure (MSM^\mathcal{S} for Mamba-2/Gated DeltaNet) with the log-linear hierarchical mask (MHM^\mathcal{H}), resulting in an effective mask M=MSMHM = M^\mathcal{S} \odot M^\mathcal{H}. The parallel forms for these log-linear variants are: Log-Linear Mamba-2: Y=(QKMSMH)VY = (Q K^\top \odot M^\mathcal{S} \odot M^\mathcal{H}) V Log-Linear Gated DeltaNet: Y=(T(Q,K)MSMH)VY = (\mathcal{T}(Q, K) \odot M^\mathcal{S} \odot M^\mathcal{H}) V, where T(Q,K)\mathcal{T}(Q, K) represents the DeltaNet-specific base interaction.

Practical implementation considerations include the development of custom Triton kernels for the chunkwise parallel scan algorithm to optimize performance on modern hardware. The implementation fuses computations across multiple levels and optimizes the backward pass. The paper reports that a custom kernel for Log-Linear Mamba-2 outperforms FlashAttention-2 for sequence lengths beyond 8K in terms of kernel runtime, and the overall model achieves higher training throughput than Transformers at 32K sequence length.

Experimental results are presented across synthetic and LLMing tasks:

  • MQAR (Synthetic): Log-Linear DeltaNet maintains high accuracy on multi-query associative recall as sequence length increases, where linear DeltaNet degrades.
  • LLMing (Pretraining): On standard short-context benchmarks (WikiText PPL, zero-shot commonsense), log-linear variants perform comparably or slightly better than their linear counterparts.
  • Per-Position Loss: Analyzing loss across token positions in long documents (Book3) shows that log-linear variants consistently reduce loss compared to linear variants, indicating improved long-range context utilization.
  • Needle-In-A-Haystack (NIAH): Log-linear variants generally show improved performance over linear counterparts on single- and multi-needle retrieval tasks at longer sequence lengths.
  • In-Context Retrieval (BASE, LongBench): Log-Linear Gated DeltaNet shows consistent gains over its linear counterpart on several retrieval and long-context understanding tasks, while Log-Linear Mamba-2 shows improvements on roughly half of the tasks. A performance gap compared to Transformers still remains on several benchmarks.

Limitations discussed include the fact that log-linear attention does not uniformly improve performance on all tasks compared to linear baselines, potentially due to suboptimal hyperparameter choices. The engineering complexity is higher due to custom kernel development. The Fenwick tree partitioning introduces an inductive bias prioritizing recent context, which might not be optimal for all applications.

In summary, Log-Linear Attention proposes a principled way to extend linear attention/SSMs with a logarithmically growing state by leveraging hierarchical matrix structures (specifically, quasi-H\mathcal{H} matrices derived from Fenwick tree partitioning). This enables O(logT)O(\log T) inference and O(TlogT)O(T \log T) training efficiency while enhancing the model's ability to capture long-range dependencies compared to fixed-state linear models, showing promising empirical results on various tasks requiring long-context understanding and retrieval.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Log-Linear Attention (41 points, 3 comments)