Papers
Topics
Authors
Recent
Search
2000 character limit reached

NAtS-L: Hybrid Token Attention Model

Updated 10 February 2026
  • The paper introduces a hybrid attention framework that dynamically routes tokens between linear and softmax backbones based on per-chunk context.
  • It employs a gating mechanism that learns to assign tokens to either efficient linear attention or expressive softmax attention during inference.
  • Experimental results demonstrate that NAtS-L achieves near-linear computational efficiency while maintaining competitive performance on long-context tasks.

Neural Attention Search Linear (NAtS-L) is a token-level hybrid attention framework designed to address the high computational cost of softmax transformers in long-context sequence modeling. By combining linear attention and softmax attention within the same layer and performing dynamic routing of tokens between the two mechanisms, NAtS-L achieves a balance between efficiency and expressivity. The approach determines at a per-chunk granularity whether token groups are best served by the efficiency of a linear attention backbone or the representational power of quadratic softmax attention. This assignment is learned directly through backpropagation, without additional search or regularization steps, and adapts contextually at inference.

1. Architectural Design

NAtS-L processes input sequences XRL×dX\in\mathbb{R}^{L\times d} by first partitioning them into T=L/CT=L/C non-overlapping chunks X[t]RC×dX_{[t]}\in\mathbb{R}^{C\times d}. Each NAtS-L layer consists of three principal modules, all sharing query, key, and value projections:

  • Attention Score Module: Computes a score vector scoretR2\text{score}_t\in\mathbb{R}^2 for each chunk tt via mean pooling followed by a learned linear projection.
  • Parallel Attention Backbones:
    • Linear Attention ("Gated DeltaNet", GDN): Maintains a chunkwise hidden state S[t]RC×dS_{[t]}\in\mathbb{R}^{C\times d}, updated through a gated recurrent formulation per chunk.
    • Softmax Attention (FlashAttention): Computes standard quadratic attention O[t]nla=softmax(QKMnla)V\mathcal{O}^{nla}_{[t]} = \operatorname{softmax}(QK^\top \odot M^{nla}) V restricted to "nonlinear" (softmax) tokens as indicated by the gating mechanism.
  • Output Weight Module: Generates scalar weights wtla,wtnlaw^{la}_t, w^{nla}_t from Q[t]Q_{[t]} for per-token mixing of linear and nonlinear outputs.

The final token representation per chunk arises from a weighted, RMS-normalized sum:

O[t]=wtlaNorm(O[t]la)+wtnlaNorm(O[t]nla).O_{[t]} = w^{la}_t \cdot \operatorname{Norm}(O^{la}_{[t]}) + w^{nla}_t \cdot \operatorname{Norm}(O^{nla}_{[t]}).

A pointwise feed-forward layer (MLP/GELU) and residual connection complete the transformer-style block (Deng et al., 3 Feb 2026).

2. Gating Mechanism

The gating mechanism determines whether each chunk is processed via linear or softmax attention. For each chunk tt,

scoret=WscoreMean(X[t])\text{score}_t = W^{score} \cdot \operatorname{Mean}(X_{[t]})

yielding logits over {linear,softmax}\{\text{linear}, \text{softmax}\}. If scoret=[stla,stnla]\text{score}_t = [s^{la}_t, s^{nla}_t], chunk tt is assigned to the softmax pathway when stnlastlas^{nla}_t \ge s^{la}_t. This hard gating is implemented via two binary masks Mla,Mnla{0,1}L×LM^{la}, M^{nla} \in \{0,1\}^{L\times L}: for each chunk, the mask fully selects either the linear or nonlinear operation. This setup ensures exclusive routing of token groups per step, facilitating tractable computation.

3. Differentiable Search and Optimization

NAtS-L’s routing assignment is incorporated directly into the forward and backward passes of training. The backward update for each score vector propagates gradients from the attention masks:

  • For softmax (nonlinear) attention:

scoretnla=i,jchunk tMi,jnla\partial\text{score}^{nla}_t = \sum_{i,j\in\text{chunk }t} \partial M^{nla}_{i,j}

where Mi,jnla=Pi,j(Pi,joioi)\partial M^{nla}_{i,j} = P_{i,j} ( \partial P_{i,j} - o_i^\top o_i ), and Pi,jP_{i,j} are the intermediate softmax probabilities.

  • For linear attention:

scoretla=i,jchunk tS[t],S[t]\partial\text{score}^{la}_t = \sum_{i,j\in\text{chunk }t} \langle \partial S_{[t]}, S_{[t]} \rangle

Inactive branches receive zero gradient, so the model learns chunk assignments under the language modeling or task loss without extra regularization or a discrete search phase. This allows the gating function (Score module) to adaptively control per-token computation based solely on the optimization objective.

4. Computational Complexity

Let LL denote sequence length, dd the hidden size, CC the chunk size, TT the number of chunks, LnlaL_{nla} the number of softmax-assigned tokens, and Lla=LLnlaL_{la}=L-L_{nla}. The measured computational costs per layer are:

Model Complexity Notes
Pure softmax (Transformer) O(L2d)O(L^2 d) Quadratic in LL
Pure linear (Chunkwise RNN) O(Ld2)O(L d^2) Chunk parallel: O((L/C)(C2+dC)d)O((L/C)(C^2 + dC)d)
NAtS-L Hybrid O(LnlaLd+Llad2)O(L_{nla}L d + L_{la}d^2) Linear if LnlaLL_{nla}\ll L, otherwise quadratic

When most tokens are routed through the linear backbone, overall complexity approaches O(Ld2)O(L d^2) like a linear transformer, while worst-case softmax routing matches Transformer cost. In practice, the gating mechanism enables significant efficiency gains where long-range retrieval is unnecessary, with memory and computation for softmax attention localized to a subset of tokens per layer (Deng et al., 3 Feb 2026).

5. Training and Inference Regimes

Training uses standard teacher-forcing LLM objectives. The gating parameters (WscoreW^{score}) are updated using gradients propagated via the attention-masked pathways. No additional regularizers or entropy penalties are needed for the gate itself. At inference, only the softmax-path chunks require key/value cache storage and quadratic QK operations; linear chunks use the compact GDN state, substantially reducing cache and compute costs. This results in adaptive memory and compute allocation based on contextual requirements.

6. Experimental Evaluation

Empirical benchmarks evaluate both accuracy and computational efficiency:

  • Language Modeling & Zero-Shot QA (≈800M parameters): On PG19, NarrativeQA, and CodeParrot (65 K tokens), NAtS-L hybrid models match or slightly surpass both pure transformer and linear baselines for long-context extrapolation. In zero-shot QA (LAMBADA, PIQA, Hellaswag, etc.), performance ties within 0.5 perplexity points across models.
  • Retrieval (4K–16K Context): On RULER with L=4k/8k/16k, NAtS-L hybrid delivers mean scores of 0.49/0.32/0.21, significantly outperforming fixed-composition GDN+Transformer (which degrades sharply to 0.02/0.00 at longer contexts). On LongBench-e, NAtS-L hybrid achieves the highest accuracy in 5 out of 11 tasks (average 33.5% vs. 30–32% for baselines).
  • Latency (NVIDIA H100): Prefilling 128K tokens, NAtS-L hybrid yields a 5.4×5.4\times speedup over Transformer, with decoding 2.3×2.3\times faster at comparable quality. Relative to pure GDN, NAtS-L is 1.66×1.66\times slower during prefill, indicating the retained cost of softmax computation on selected tokens.
  • Context-Dependent Gating: The fraction of tokens assigned to softmax per head × layer varies by corpus (PG19, code, NarrativeQA), supporting the claim of domain-adaptive routing.

These experimental outcomes demonstrate that NAtS-L recovers high-softmax-attention expressivity on demand, while achieving near-linear efficiency on locally constrained chunks through learned token gating (Deng et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neural Attention Search Linear (NAtS-L).