When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective (2503.11272v1)

Published 14 Mar 2025 in stat.ML and cs.LG

Abstract: Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative perspective and prove that even with infinite compute, feedforward and recurrent networks may suffer from larger sample complexity compared to Transformers, as the latter can adapt to a form of dynamic sparsity. Specifically, we consider a sequence-to-sequence data generating model on sequences of length $N$, in which the output at each position depends only on $q$ relevant tokens with $q \ll N$, and the positions of these tokens are described in the input prompt. We prove that a single-layer Transformer can learn this model if and only if its number of attention heads is at least $q$, in which case it achieves a sample complexity almost independent of $N$, while recurrent networks require $N^{\Omega(1)}$ samples on the same problem. If we simplify this model, recurrent networks may achieve a complexity almost independent of $N$, while feedforward networks still require $N$ samples. Consequently, our proposed sparse retrieval model illustrates a natural hierarchy in sample complexity across these architectures.

Summary

The paper rigorously investigates the sample complexity of Transformers, feedforward, and recurrent networks in sequence-to-sequence models with sparse dependencies.
The study finds that Transformers with sufficient attention heads ($>= q$) achieve sample complexity nearly independent of sequence length ($N$), unlike recurrent ($N^{\Omega(1)}$) or feedforward ($N$) networks.
This statistical efficiency highlights that attention mechanisms enable Transformers to effectively utilize dynamic sparsity, creating a hierarchy of sample complexity favoring Transformers for long sequences with sparse relevance.

Introduction and Problem Setting

The paper "When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective" (2503.11272) rigorously investigates the statistical efficiency of different network architectures under a sequence-to-sequence model. The study considers input sequences of length $N$ , with outputs depending solely on $q$ crucial tokens, where $q \ll N$ . This setting captures a form of dynamic sparsity inherent in many real-world tasks. In contrast to traditional analyses focusing on representational capacity, the work evaluates architectures from a sample complexity perspective, highlighting where Transformers can achieve statistically efficient learning relative to feedforward and recurrent networks.

Theoretical Results and Sample Complexity Analysis

The key theoretical contribution is the derivation of sample complexity bounds for the considered sparse retrieval model. Notably:

Transformer Architecture: A single-layer Transformer is shown to learn the model effectively if and only if the number of attention heads is at least $q$ . Under this condition, the sample complexity is nearly independent of the sequence length $N$ . This indicates that Transformers can exploit the sparsity in the data by focusing on the relevant tokens, thus avoiding the curse of dimensionality typically associated with long sequences.
Recurrent Networks: Recurrent architectures are proven to require at least $N^{\Omega(1)}$ samples under the same model setting, implying a polynomial dependence on the sequence length. This reveals a significant disadvantage in scenarios where $N$ is large, as the sample complexity escalates with increasing sequence length.
Feedforward Networks: In a simplified scenario, where architectural constraints of recurrence are relaxed, feedforward networks still necessitate $N$ samples for effective learning. This directly contrasts with the almost $N$ -independent sample complexity achieved by Transformers, underscoring the benefits of attention-based mechanisms.

These results present strong numerical contrasts: whereas the transformer model mitigates the sample dependency on $N$ via sufficient head allocation (i.e., ensuring at least $q$ heads), both recurrent and feedforward models inherently suffer from significant sample inefficiency as sequence lengths grow.

Architectural Implications

The study emphasizes several architectural implications pertinent to model design:

Attention Mechanism as Dynamic Sparsity: By incorporating multiple attention heads, Transformers selectively attend to the subset of $q$ relevant tokens. This dynamic sparsity mechanism is central to their statistical efficiency, a feature absent in both feedforward and recurrent networks.
Head-Count Necessity: The condition that the transformer must have at least $q$ attention heads is not merely an architectural guideline but a fundamental threshold that underpins its ability to generalize with a sample complexity that remains almost invariant with respect to $N$ . This threshold serves as a contradictory claim to any notion that single-head or suboptimal head configurations might suffice in sparse retrieval settings.
Model Hierarchy Insights: The paper establishes a natural hierarchy in terms of sample complexity—Transformers (with adequate heads) outperform feedforward networks, which in turn typically outperform recurrent networks when confronted with tasks involving sparse dependencies within long sequences.

Discussion and Concluding Remarks

The paper provides a comprehensive statistical perspective on when Transformers achieve superior performance relative to traditional architectures. It rigorously demonstrates that the attention mechanism, particularly when sufficiently diversified into at least $q$ heads, allows Transformers to bypass the usual limitations imposed by long sequence lengths. While recurrent networks incur a polynomial penalty and feedforward networks remain linearly dependent on $N$ , Transformers benefit from a sample complexity that is almost independent of the sequence length, capturing the essence of sparse dynamical relevance.

In summary, the work establishes that Transformers, given an adequate number of attention heads, present a statistically superior alternative to feedforward and recurrent networks in environments characterized by dynamic sparsity, with significant sample complexity improvements that could impact architectural decisions in sequence modeling tasks.