Attention-Only Transformers

Updated 12 February 2026

Attention-only transformers are minimalist sequence models that rely exclusively on self-attention and residual connections to execute symbolic reasoning and associative memory tasks.
They provide a tractable framework for mechanistic interpretability by decomposing reasoning into additive and contrastive circuits without feed-forward complexities.
Architectural modifications like normalized attention pooling enhance expressivity while preserving parameter efficiency compared to traditional transformer models.

Attention-only transformers are a class of sequence models derived from the transformer architecture where the network is constructed solely from attention-based operations and associated residual connections, explicitly omitting any feed-forward (MLP) or convolutional sublayers. Originating from a systematic reduction of full transformer models, attention-only transformers have become essential objects of study in order to dissect the minimal mechanisms by which attention architectures can implement symbolic reasoning, memorization, and generalization tasks, and have motivated both theoretical and empirical work in mechanistic interpretability, expressivity, and design of machine learning systems (Adhikari, 28 Oct 2025, Huben et al., 2023).

1. Foundational Architectures and Definition

The standard transformer, introduced as “Attention Is All You Need” (Vaswani et al., 2017), consists of sequential layers, each composed of multi-head attention and a position-wise feed-forward (MLP) network, together with residual connections and layer norm. The attention-only transformer (AoT) removes the MLP and normalization sublayers, leaving only the attention layers interleaved with residual paths. Mathematically, the canonical attention-only layer is:

$\text{AttentionLayer}(X) = X + \sum_{i=1}^{H} \mathrm{softmax}\Bigl( \tfrac{Q_i K_i^T}{\sqrt{d_k}} \Bigr) V_i W_i^O$

with $Q_i = X W_i^Q$ , $K_i = X W_i^K$ , and $V_i = X W_i^V$ , where $X \in \mathbb{R}^{T \times d}$ is the token sequence, $H$ is the number of attention heads, $d$ is the hidden dimension, and $d_k$ is the head dimension.

Minimal AoT models have been demonstrated in symbolic reasoning settings, e.g., a single-layer, two-head architecture ( $L=1, H=2$ , $d=8$ , $d_k=d_v=4$ ), with no MLP and no layer norm, achieves perfect accuracy on the Indirect Object Identification (IOI) symbolic benchmark, indicating that pure attention mechanisms can implement nontrivial relational computations in an extremely compact form (Adhikari, 28 Oct 2025).

2. Expressivity and Theoretical Characterizations

Expressivity of attention-only transformers is characterized through their ability to represent functions, encode logical patterns, and memorize associations, compared to the full transformer family. Notable results include:

Unique hard attention (UHA) variants, where the attention focuses deterministically on a single position per query, have been proven to capture exactly the class of linear temporal logic (LTL) definable (or star-free) languages if allowed to break ties in both leftmost and rightmost directions. Restricting to leftmost-only (or softmax attention in finite precision), expressivity drops to the PastLTL fragment; rightmost-only corresponds to FutLTL (Jerad et al., 18 Mar 2025).
Attention-only transformers with finite-precision softmax (and strictly causal masking) are thus strictly weaker than unrestricted LTL: for example, they cannot implement “output the symbol immediately to the left” (local look-back by one), which lies outside PastLTL.
The capacity for exact associative memorization in a single-layer AoT is given by the linear bound $T_0 \leq H d_h + d$ , where $T_0$ is the number of input–output pairings memorized, $H$ is the number of heads, $d_h$ is the internal head dimension, and $d$ is the residual dimension (Dana et al., 2024). This sharply improves upon past sequence-memory bounds, notably removing restrictions based on sequence length and confirming that AoTs are parameter-efficient associative memories.
Approximately, AoT can minimize expected KL divergence to a target distribution up to the sequence encoder lower bound, requiring $H d_h + d \geq T_\epsilon$ for $\epsilon$ –approximate memorization on subsets with mass $1-\epsilon$ .

3. Implementability of MLPs and Universality

It has been proven that any MLP neuron (with activation from a wide analytic family including SiLU and close approximations of ReLU and GeLU) can be exactly simulated by a masked attention head of internal dimension 1, provided the residual stream is augmented with a constant “bias” token. Thus, every alternating MLP layer in a standard transformer can be replaced with rank-1 attention heads (one per MLP neuron), yielding a construction in which bias-augmented, attention-only layers suffice to emulate the full original model (Huben et al., 2023). The trade-off is an increase in the number of attention heads, but not in total parameter count.

Furthermore, one can explicitly decompose the MLP operation $x \mapsto \alpha(xW_1)W_2$ into three attention heads: a linear-head for $W_1$ , a set of activation-heads for each entry (coordinatewise $\alpha$ ), and a final linear-head for $W_2$ . Masking constraints required for these constructions (such as isolating the bias token) can be implemented via extremely large negative logits, which sharpens the connection between mask-paddings and attention weights.

4. Mechanistic Interpretability and Minimal Circuits

Mechanistic analysis of AoT models trained on synthetic reasoning tasks reveals highly interpretable structures:

In the IOI task, a one-layer, two-head AoT decomposes into additive and contrastive subcircuits. “Head 0” attends equally to both names in the clause, summing their embeddings. “Head 1” attends to the main-clause subject and the non-IO name, subtracting the latter’s embedding. The total residual at the output position is $2 e_{\mathrm{IO}}$ , which the softmax unembedding reads off with certainty (Adhikari, 28 Oct 2025).
Task ablations (e.g., removing positional embeddings) reduce the model’s accuracy (to ~70%), demonstrating the necessity of both spatial and content-based mechanisms.
In two-layer, one-head AoTs, layerwise composition allows “re-entrance” over residual streams, recovering similar additive–contrastive operator patterns.
These results suggest that task-constrained, attention-only models consistently favor highly interpretable, linearly decomposable “add & subtract” circuits for relational reasoning, and are more disentangled than the distributed, multi-hop circuits seen in full transformers.

5. Alternative Attention Mechanisms and Architectural Modifications

Attention-only models can be further varied and analyzed along several axes:

Employing normalized attention pooling (NAP), where the softmax is replaced by per-query layer normalization over logits, produces attention weights not restricted to the simplex. This abolishes the “convex hull cage,” enabling outputs that are arbitrary linear combinations of value vectors; in contrast, softmax-insisted convexity restricts each head’s output to the simplex-spanned region. NAP thus allows a single attention head to implement XOR, which is impossible for softmax attention, and achieves improved expressivity and robustness across task families and hyperparameters (Richter et al., 2020).
Compositionality via stacked attention-only layers enables information propagation and complex circuit formation even in the absence of feed-forward (MLP) units, explaining empirical success on tasks such as parsing and coreference (Adhikari, 28 Oct 2025).

6. Analytical and Empirical Performance

Extensive evaluation of AoTs includes:

Domain / Model	Params	Top-1 Acc. (Vision)	LM Val. Loss	LAMBADA Acc.	SOTA Reference
AoT-MSSA-V (ImageNetV2)	22M	71.7%	—	—	CRATE: 79.5% [params:39M]
AoT-MHSA-V (ViT-Base comp.)	15M	69.5%	—	—	ViT-Base: 72.4% [22M]
AoT-MHSA-L Base (LM, OWT)	122M	—	~4.42	0.38	GPT-2 Base: ~4.32/0.40
AoT-MSSA-L Base (LM, OWT)	102M	—	~4.70	—

AoTs perform near parity with MLP-augmented transformers on language modeling and vision tasks with only a modest parameter overhead to offset the loss of MLP capacity. For hallucination-free memorization and explicit look-up tasks, AoTs are as parameter-efficient as (and sometimes outperform) MLP-based networks with equivalent capacity (Dana et al., 2024, Wang et al., 4 Jun 2025).

7. Interpretive and Practical Implications

Attention-only transformers, by construction, reveal the minimal circuit complexity required for symbolic and relational reasoning, providing uniquely tractable systems for mechanistic interpretability (Adhikari, 28 Oct 2025).
The elimination of MLPs, while limiting model expressivity on tasks requiring non-linear or multi-task generalization, is generally not a barrier to basic relational, memory, or denoising-based learning objectives.
Theoretical work demonstrates that the details of attention policy (e.g., directionality in hard or biased attention, masking, normalization) have direct and sometimes drastic implications for the set of languages/functions implementable in finite precision (Jerad et al., 18 Mar 2025, Richter et al., 2020).
In practice, architectural modifications such as NAP or softmax thresholding, as well as task-specific augmentations with bias tokens or custom heads, can systematically extend or refine AoT capabilities, sometimes surpassing standard variants in robustness and generality (Richter et al., 2020).

Attention-only transformers form both a theoretical minimum and a practical backbone for the study and deployment of interpretable, compositional, and highly modular sequence models. They provide direct insight into the foundations and limits of deep attention architectures, and inform architectural and training choices for efficient and robust model design in contemporary machine learning.