Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot (2406.06893v1)

Published 11 Jun 2024 in stat.ML, cs.IT, cs.LG, and math.IT

Abstract: The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-of-distribution length generalization. We provide empirical simulations to justify our theoretical findings.

PDF Abstract

Analysis of "Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot"

This work investigates the learning capabilities of the transformer architecture, specifically its ability to learn the Sparse Token Selection ( $q$ -sparse averaging or $) task. The authors, Zixuan Wang, Stanley Wei, Daniel Hsu, and Jason D. Lee, propose a theoretical and empirical framework to illustrate that transformers can provably perform certain sparse token selection tasks effectively, while fully-connected networks (FCNs) fail to do so in both worst-case and average-case scenarios. ### Theoretical Insights The paper begins by positioning the transformer architecture as a dominant force across a multitude of deep learning tasks, supported by its self-attention mechanism. This mechanism allows transformers to efficiently capture and utilize structural information embedded within token sequences. The authors extend the existing knowledge by proposing a training process for a one-layer transformer that converges using gradient descent (GD) to learn the $q $-sparse averaging task ($ ). The input consists of tokens $X$ drawn from a standard Gaussian distribution, with a subset $y$ uniformly sampled. They show that a one-layer transformer with a self-attention layer can express and learn $%%%%4%%%%\Theta(d+q\log T)%%%%5%%%%d%%%%6%%%%\Omega(Td)$ width required for FCNs in approximating the same task.

Gradient Descent Dynamics

A central theorem in the paper characterizes the behavior of the transformer model parameters under GD using stochastic positional encoding. The authors show that for a given sequence length $T$ , sparsity $q$ , and a fixed subset subset size, with suitable initialization and regularization, the gradient descent updates ensure that the model parameters converge to the theoretical optimum. The iterative updates ensure that transformers track the correct sparse averaging behavior, indicated by the convergence of the parameters $W$ and $V$ to specific structured forms.

Empirical Verification

Empirically, the authors validate their theoretical results using synthetic datasets. The experiments demonstrate that transformers, trained with random near-orthogonal positional encodings, outperform fixed positional encodings in both training and, crucially, in out-of-distribution (OOD) generalization—particularly regarding length generalization on longer input sequences.

Strong Numerical Findings

Numerical simulations affirm the theoretical claims:

Convergence to Zero Error: Experiments show near-zero in-distribution training loss for stochastic positional encoding, confirming the theoretical global convergence guarantees.
Superior Length Generalization: Out-of-distribution length generalization experiments reveal significantly lower validation loss for transformers using stochastic positional encoding compared to those with fixed encoding.

Implications and Future Directions

Practical Implications

The implications of these findings are significant for practical applications in NLP and other domains where data structures exhibit sparsity and demand efficient learning models:

Transformer Efficiency: The research confirms that transformers require substantially less memory and computational resources for specific arithmetic tasks compared to FCNs. This efficiency can be leveraged to optimize cutting-edge NLP models, computational vision, and reinforcement learning systems.
Robustness: The paper illustrates that stochastic positional encoding not only provides a robust mechanism for generalization in sequence length, but also holds promise for enhancing the extrapolation capabilities of transformers in various sparse data environments.

Theoretical Implications and Future Work

The theoretical insights extend the scope of the transformer’s inductive bias, suggesting several future directions:

Higher-Dimension Generalization: Future work could explore transformers' capabilities in even higher-dimension or higher-complexity sparse tasks.
Multi-Layer Transformers: While this paper focuses on one-layer transformers, extending the theoretical framework to multi-layer architectures remains an open and valuable area.
Deep Dive into Positional Encodings: Further investigations into the properties and potentials of different types of positional encodings could provide richer theoretical insights and practical tools.
Sample Complexity: Introducing empirical risk minimization into the dynamics studied could lead to more practical learning guarantees, especially in sample complexity.

Conclusion

This paper successfully demonstrates that transformers can provably outperform FCNs in learning specific sparse token selection tasks, both theoretically and empirically, with the significant advantage of being efficiently trainable with gradient descent. By addressing the learnability and expressive power of transformers in sparse data tasks, this work advances our understanding of why transformers have become the cornerstone architecture in contemporary AI systems.