One-layer Multi-head Transformers

Updated 18 August 2025

One-layer multi-head transformers are neural architectures that deploy multiple parallel attention heads to learn diverse and distinct representations from input embeddings.
They enable efficient sequence modeling, contextual memorization, and universal function approximation through mechanisms like multi-query and multi-branch attention.
Their design offers rapid training convergence and scalability across domains, though they face challenges in handling long-range dependencies compared to deeper models.

A one-layer multi-head transformer is a neural architecture characterized by a single layer in which multiple attention “heads” operate in parallel, each learning distinct projections and attention patterns over shared or distinct input embeddings. This architectural configuration, central to both the original Transformer and many of its modern adaptations, offers an expressive, highly parallelizable mechanism for sequence modeling, contextual representation, memorization, in-context learning, scalability, and reasoning, while presenting unique theoretical and practical trade-offs relative to deeper or single-head architectures.

1. Architectural Principles and Mathematical Formulation

A one-layer multi-head transformer consists of an input projection (typically for queries, keys, values), a parallel collection of attention heads, a concatenation and final output projection, and (optionally) a feedforward block. The canonical mathematical formulation for the multi-head attention mechanism is:

For input $X \in \mathbb{R}^{n \times d}$ ( $n =$ sequence length, $d =$ embedding dim), $h$ heads, and per-head projections $W_q^{(i)}, W_k^{(i)}, W_v^{(i)} \in \mathbb{R}^{d \times d_h}$ :

For $i = 1, …, h$ $i = 1, \dots, h$ :
- $Q_i = X W_q^{(i)}$ ,
- $K_i = X W_k^{(i)}$ ,
- $V_i = X W_v^{(i)}$ ,
- $\operatorname{Head}_i = \operatorname{softmax}(Q_i K_i^\top / \sqrt{d_h}) V_i$ .
The outputs $\{\operatorname{Head}_i\}$ are concatenated and projected: $Y = \operatorname{Concat}(\operatorname{Head}_1, …, \operatorname{Head}_h) W_o$ .

Many variants exist: "multi-query" attention (Shazeer, 2019) reduces memory usage during incremental decoding by sharing a single set of keys and values across all heads; "multi-branch" attention averages outputs from multiple independently parameterized multi-head blocks (Fan et al., 2020); and further architectural augmentations (such as parallelization of attention and MLP blocks (Zhong et al., 2022), horizontally/vertically reweighted heads (Yu et al., 2022), or explicit overlapping-between-heads (Zhang et al., 18 Oct 2024)) have also been developed. Variants like Multiformer (Sant et al., 2022) allow each head to employ heterogeneous attention mechanisms; HydraViT (Haberer et al., 26 Sep 2024) enables dynamic stacking/dropping of heads for resource-scalable deployment.

The expressivity of such a layer can be formally characterized—not just as an ensemble of rank-constrained attention maps but, under certain conditions, as a universal approximator of permutation-equivariant functions (Kajitsuka et al., 2023).

2. Expressive Capacity, Memorization, and Universal Approximation

One-layer multi-head transformers possess significant theoretical capacity for contextual modeling and memorization. It has been shown that:

Under mild linear independence conditions (e.g., Kruskal rank at least $n$ ), a single multi-head attention layer with $H$ heads, key/query dimension $d$ , and context length $n$ can memorize $\Omega(H n)$ distinct examples using $\Theta(H d^2)$ parameters (Mahdavi et al., 2023).
The softmax operator’s “saturation property” allows precise allocation of individual heads to memorize disjoint subsets of the data; heads can essentially act as independent lookup units, driving near one-hot attention patterns (Mahdavi et al., 2023).
With appropriately chosen low-rank weight matrices, a one-layer self-attention mechanism with softmax functions is a universal approximator for continuous permutation-equivariant functions on compact domains (provided it is sandwiched by suitable feedforward networks) (Kajitsuka et al., 2023). This result formally establishes that the majority of the theoretical expressive power attributed to transformers is already present in the one-layer multi-head configuration, rather than deriving purely from depth or width.
The actual settings where one-layer transformers are “functionally sufficient” depend heavily on input encoding. If key-value pairings are preserved in a single token or via redundant position encoding, one-layer architectures can exactly simulate arbitrary function evaluations with only polylogarithmic (in $n$ ) communication size (h·d·p) (Strobl et al., 28 Mar 2025). Conversely, input formats that force information to be aggregated and disambiguated across different positions (e.g., keys and values only accessible in consecutive but permuted tokens) drive up the lower bound on h·d·p to $\Omega(n \log n)$ , unless depth is increased (Sanford et al., 26 Aug 2024, Strobl et al., 28 Mar 2025).

3. Optimization, Generalization, and Training Dynamics

From an optimization standpoint, shallow (one-layer) multi-head transformers exhibit several notable properties:

Training via gradient descent is locally stable and converges rapidly (empirically and theoretically) when sufficiently overparameterized in the number of heads, under a “realizability” condition—that is, there exists a model with low loss not far from initialization (Deora et al., 2023). Stability, generalization (i.e., a $O(1/n)$ generalization gap after O(n) steps), and robustness to initialization can all be formally quantified. For models trained on tokenized mixtures or other semi-structured inputs, a single (randomized) gradient step from zero initialization suffices to achieve strong Neural Tangent Kernel (NTK) separability.
Multi-head attention improves training stability over single-head, deep alternatives. While a deep stack of single-head layers can, in principle, attend jointly to as many positions as a shallow multi-head layer, the former is much more susceptible to optimization difficulties like shattered gradients, unless advanced initialization (e.g., Admin) is applied (Liu et al., 2021).
In practical in-context learning tasks (sparse regression, nonlinear regression), one-layer multi-head transformers provably converge linearly to global minima and learn contextual abstract templates (i.e., solve ridge regression over basis functions), thus achieving robust generalization to new inputs and even unseen tasks (Chen et al., 8 Aug 2024, Yang et al., 19 Aug 2024).
The optimization and learning-theoretic properties extend to scenarios where inputs are highly non-Gaussian or Boolean; provable, sample-efficient algorithms (polynomial in problem dimensions for fixed number of heads) exist for recovering all parameters of the attention heads from random input–output pairs, although with exponential complexity in h in the worst case (Chen et al., 6 Feb 2024).

4. Functional Specialization, Head Coordination, and Practical Usage

The multi-head mechanism enables diverse forms of specialization, supporting sophisticated algorithms even in a single layer:

In chain-of-thought or multi-step symbolic reasoning tasks (e.g., path-finding in trees), different heads autonomously specialize to distinct sub-tasks and coordinate sequential outputs in autoregressive fashion—one head “tracks content” while another “controls stage transitions” (e.g., detecting when to reverse a path) (Yang et al., 11 Aug 2025).
In in-context learning (for regression or few-shot tasks), all heads in the first layer are jointly utilized to “preprocess” or “reweight” input features, with subsequent layers then executing simple iterative optimization (e.g., gradient descent), usually using only a single dominant head (Chen et al., 8 Aug 2024). These patterns have been empirically supported by head-masking and ablation studies.
Multi-head multi-attention models like Multiformer (Sant et al., 2022) configure each head with a different attention mechanism (e.g., local, global, ConvAttention), allowing the extraction of more diverse token interactions and promoting a uniform distribution of head relevance, which correlates with improved translation quality.
Modifications such as horizontal and vertical attention (Yu et al., 2022), multi-branch architectures (Fan et al., 2020), or overlapping heads (Zhang et al., 18 Oct 2024) further enhance the capacity of a one-layer transformer to learn distinctive, robust representations and to allow richer interaction across heads/channels while keeping parameter increases and resource overhead minimal.

5. Scalability, Efficiency, and Deployment Strategies

One-layer multi-head transformers offer design flexibility, including:

Multi-query attention (shared keys/values for all heads) minimizes memory bandwidth and substantially accelerates incremental decoding in autoregressive inference, at minimal quality cost (Shazeer, 2019).
HydraViT and similar approaches enable training a universal model with stacked heads and partitioned embeddings, so that any number of heads (and hence capacity/compute footprint) can be dynamically selected at inference time without retraining (Haberer et al., 26 Sep 2024). This strategy achieves high efficiency and adaptability on a spectrum of hardware resources, with improved accuracy–throughput/GMACs trade-offs over baselines.
Modularity and resource proportionality: The head stacking, branch-drop, and head-overlap techniques are universal and can be plugged into any one-layer multi-head transformer variant with little architectural disruption (Fan et al., 2020, Yu et al., 2022, Zhang et al., 18 Oct 2024).

6. Limitations, Theoretical Barriers, and Comparative Analysis with Depth

Despite their expressive power, one-layer multi-head transformers have specific theoretical and empirical limitations:

There exist algorithmic tasks (notably, the induction heads task for sequence modeling) for which any one-layer transformer requires a h·d·p that grows linearly with input length $n$ , whereas a two-layer construction achieves the same with only O(1) parameters and logarithmic precision (Sanford et al., 26 Aug 2024). This result follows from a communication complexity reduction, highlighting a fundamental barrier to efficiency and scalability in shallow architectures.
In function evaluation problems with input representations that “split” information across positions in an unaligned manner, concise one-layer transformers cannot succeed (requiring $\Omega(n \log n)$ c–size), but concise two-layer architectures can perform these tasks (Strobl et al., 28 Mar 2025).
The absence of cross-head interaction within a single layer constrains the compositionality of features; while innovations like overlapping heads (Zhang et al., 18 Oct 2024) and multi-branch modules (Fan et al., 2020) alleviate this, stacked depth fundamentally enables more efficient and robust composition.
For learning long-range or deeply hierarchical dependencies, adding additional layers is often more parameter- and compute-efficient than expanding width, given otherwise similar model sizes and resource constraints (Liu et al., 2021).
The typical failures of one-layer designs on induction, memorization, and complex lookup can serve as guidelines for practical transformer design: shallow, multi-head-wide architectures are ideal for tasks requiring fast, parallel pattern extraction over local features, but depth (even to just two layers) is often indispensable for information propagation and integration across positions or for nonlocal symbolic tasks (Sanford et al., 26 Aug 2024, Strobl et al., 28 Mar 2025).

7. Empirical Performance and Domain-specific Considerations

One-layer multi-head transformers have been successfully—sometimes state-of-the-art—deployed in multiple domains:

In machine reading comprehension, a one-layer dual co-attention module suffices for multi-choice reasoning when paired with strong encoders and multi-task learning, outperforming deeper attention stacks (Wan, 2020).
In sequence generation and translation, MAT and multi-query variants consistently enhance BLEU score and decoding speed while maintaining interpretability (Shazeer, 2019, Fan et al., 2020).
In speech tasks, custom per-head attention designs (local, convolutional) reduce information loss and enhance generalization (Sant et al., 2022).
In vision, head overlap and head stacking have boosted accuracy and efficiency across ImageNet-scale models, enabled resource-adaptive deployment, and demonstrated consistent head-importance ordering (Haberer et al., 26 Sep 2024, Zhang et al., 18 Oct 2024).

These empirical results reinforce theoretical predictions regarding capacity, efficiency, and the crucial role of architectural customization—particularly for shallow transformer models.

In total, one-layer multi-head transformers constitute a powerful and versatile class of neural architectures. Their capacity, theoretical foundation, and empirical performance depend heavily on the configuration of heads, the nature of input representations, and design enhancements for efficiency and specialization. While fundamentally limited by communication complexity in certain algorithmic tasks, their utility in practical modeling—and their role as the critical foundation for deeper transformer networks—remains both empirically and theoretically well supported.