Don't Pay Attention (2506.11305v1)

Published 12 Jun 2025 in cs.CL and cs.AI

Abstract: The Transformer has become the de facto standard for LLMs and a wide range of downstream tasks across various domains. Despite its numerous advantages like inherent training parallelism, the Transformer still faces key challenges due to its inability to effectively process sequences beyond a fixed context window and the quadratic complexity of its attention mechanism. These challenges have renewed interest in RNN-like architectures, which offer linear scaling with sequence length and improved handling of long-range dependencies, albeit with limited parallelism due to their inherently recurrent nature. In this paper, we propose Avey, a new neural foundational architecture that breaks away from both attention and recurrence. Avey comprises a ranker and an autoregressive neural processor, which collaboratively identify and contextualize only the most relevant tokens for any given token, regardless of their positions in the sequence. Specifically, Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences. Experimental results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while notably excelling at capturing long-range dependencies.

Authors (2)

Mohammad Hammoud (3 papers)
Devang Acharya (2 papers)

Summary

The paper "Don't Pay Attention" (Hammoud et al., 12 Jun 2025 ) introduces Avey, a novel neural architecture for LLMing that departs from both attention and recurrence, aiming to address limitations of existing models like Transformers and RNNs in handling long sequences efficiently.

Transformers (Hammoud et al., 12 Jun 2025 ), while enabling high parallelism and achieving state-of-the-art results, suffer from quadratic computational and memory complexity with respect to sequence length due to the self-attention mechanism. This makes processing arbitrarily long sequences challenging within a fixed context window. RNN-like models, such as State Space Models (SSMs) like Mamba (Hammoud et al., 12 Jun 2025 ) and linear attention models like RWKV (Hammoud et al., 12 Jun 2025 ), offer linear scaling but face limitations in parallelism (RNNs) or have historically underperformed compared to Transformers on LLMing tasks, struggling to effectively generalize to very long contexts beyond their training window.

Avey proposes to overcome these limitations by decoupling sequence length from context width. It achieves this through a weighted-selective-split interaction mechanism, which allows the model to process arbitrarily long sequences by selectively identifying and contextualizing only the most relevant tokens, regardless of their position. This mechanism relies on two main components: a ranker and an autoregressive neural processor.

Avey Architecture Components:

Ranker: This component partitions the input sequence into equal-sized "splits" of contiguous token embeddings. For a "current split" (the one being processed or used to predict the next token), the ranker identifies the top k most relevant preceding splits. Relevance is determined using the MaxSim operator (Hammoud et al., 12 Jun 2025 ), which measures similarity between embeddings in the current split and each preceding split. The top k splits are then weighted by their normalized MaxSim scores, scaling their contribution during contextualization. The ranker is invoked only once per full forward/backward pass during training.
- Practical Implication: This mechanism allows Avey to access information from tokens far outside the immediate context window, addressing the fixed context window limitation of Transformers.
- Training Complexity: The ranker's computation involves comparing the current split against all preceding splits, resulting in a training time complexity of $O(N^2 d)$ , where $N$ is the sequence length and $d$ is the embedding dimension.
Neural Processor: This component processes the current split along with the weighted top k relevant splits identified by the ranker. It consists of three sub-units:
- Enricher: A position-wise neural network that expands the dimensionality of token embeddings. This aims to increase the quantity of learnable features, providing richer representations for the contextualizer. The output embeddings are split into a "head" portion and a "tail" portion.
  - Practical Implication: Feature expansion helps the model capture more nuanced information within individual token representations.
  - Partial-Embedding Bypassing: The head portion is directly bypassed to the fuser, preserving original embedding characteristics and potentially mitigating issues like entropy collapse or over-smoothing in deeper models (Hammoud et al., 12 Jun 2025 ).
  - Complexity: $O(Nmd)$ , where $m$ is the expanded embedding dimension.
- Contextualizer: An embedding-wise neural network that processes the "tail" portion of the enriched embeddings from the current and selected relevant splits. It enables inter-embedding, data-dependent interactions. The tail portion is further split into a gating mechanism part and a contextualized part, allowing the model to dynamically regulate information flow. The contextualizer's parameterization is dynamic (input-dependent), drawing inspiration from gMLP (Hammoud et al., 12 Jun 2025 ) and selective SSMs like Mamba (Hammoud et al., 12 Jun 2025 ).
  - Practical Implication: Dynamic parametrization makes the model selective, allowing it to focus on or disregard information based on the input, which is crucial for handling relevant vs. irrelevant tokens from long sequences.
  - Complexity: $O(NkC m_t)$ during training, where $k$ is the number of top splits, $C$ is the context width, and $m_t$ is the tail dimension.
- Fuser: A position-wise neural network that combines the uncontextualized "head" features (bypassed from the enricher) and the contextualized features from the contextualizer. It projects the combined features back to the original embedding dimension $d$ $d$ .
  - Practical Implication: This unit integrates information from both raw features and contextualized interactions.
  - Complexity: $O(Nmd)$ .

Overall Computational Complexity:

Training: Dominated by the ranker, resulting in $O(N^2 d)$ complexity, similar to standard Transformers.
Inference: Dominated by the ranker comparing the current split with previous splits, resulting in a theoretical complexity of $O(Nd)$ per token when considering the cumulative cost over the sequence. The paper's empirical Time to First Token (TTFT) benchmarks show Avey scales significantly better (empirically faster) than Transformer++, Mamba, and RWKV-7 (Hammoud et al., 12 Jun 2025 ). This practical efficiency is attributed to the ranker being invoked only once per forward pass, adding minimal overhead during generation.

Experimental Evaluation:

The authors conducted extensive experiments, including over 200 runs for design choices and ablations, using a fixed budget of 100 billion tokens from the FineWeb dataset (Hammoud et al., 12 Jun 2025 ). Avey was compared against strong open-source baselines: Transformer++ (an optimized Transformer implementation), Mamba, and RWKV-7, across small ( $\sim$ 150M), medium ( $\sim$ 500M), and large ( $\sim$ 1.5B) parameter scales. Evaluations used the LM Evaluation Harness (Hammoud et al., 12 Jun 2025 ) on standard zero-shot NLP benchmarks (ARC-E/C, HellaSwag, PIQA, OBQA, SIQA, Winogrande) and the RULER S-NIAH suite (Hammoud et al., 12 Jun 2025 ) for long-context retrieval.

Short-Range Results: Avey performs comparably to or slightly outperforms Transformer++, Mamba, and RWKV-7 at small and medium sizes. At the large size, Avey slightly underperforms the baselines. The authors note that Avey's performance on these benchmarks was based on configurations optimized for downstream task performance rather than perplexity.
Scaling Laws: Experiments following Chinchilla laws (Hammoud et al., 12 Jun 2025 ) (increasing training tokens proportionally with model size) show that Avey exhibits a steeper perplexity reduction curve than Transformer++, Mamba, and RWKV-7, suggesting stronger scaling behavior with increased compute.
Long-Range Results (S-NIAH): A key strength of Avey is demonstrated on the S-NIAH benchmark. Despite being trained with a short context window (512 tokens vs. 2048 for baselines), Avey shows remarkable extrapolation capability, maintaining high accuracy on retrieval tasks with sequence lengths up to 64k tokens. In contrast, Transformer++, Mamba, and RWKV-7's performance drops significantly beyond their trained context windows. Avey's performance surprisingly tends to improve with longer haystacks, potentially because the ranker has a larger pool of splits to choose from for relevant context.

Design Choices and Ablations:

Extensive ablation studies (Hammoud et al., 12 Jun 2025 ) validated the contribution of key Avey components. Removing or disabling:

Dynamic parameterization in the contextualizer increased perplexity and reduced downstream performance.
Partial-embedding bypassing significantly increased perplexity and reduced performance.
Embedding expansion in the enricher (setting expansion factor to 1) substantially increased perplexity and reduced performance.
Weighting selected splits by normalized MaxSim scores increased perplexity and slightly reduced performance.
The ranker (while still processing full sequences) slightly increased perplexity but reduced average downstream performance, confirming its benefit beyond just extrapolation.
Replacing the neural processor with self-attention increased perplexity and reduced performance, suggesting the neural processor is more effective within Avey's design.

Limitations:

The authors acknowledge that the work is limited to textual data and autoregressive LLMing. Avey's ability to learn bidirectional representations (like BERT) was not investigated. While the theoretical training complexity is $O(N^2 d)$ , the current implementation is slower than optimized baselines, indicating a need for further engineering optimization. The paper focuses on effectiveness rather than efficiency of the current implementation.

In summary, Avey presents a promising new direction for LLMing by ditching attention and recurrence in favor of a ranker and a selective neural processor. Its ability to decouple sequence length from context width and extrapolate effectively to very long sequences, especially as demonstrated on the S-NIAH benchmark, highlights its potential for applications requiring long-context handling, such as processing lengthy documents or engaging in extended conversations, while its empirical TTFT performance suggests suitability for latency-sensitive deployments.

PDF Markdown

Tweets

https://twitter.com/rainnekoneko/status/1934718827528638929

YouTube

Show All Videos

Don't Pay Attention (2506.11305v1)

Summary

Related Papers

Tweets

YouTube