Separations in the Representational Capabilities of Transformers and Recurrent Architectures (2406.09347v1)

Published 13 Jun 2024 in cs.LG and stat.ML

Abstract: Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of $N$ nearly orthogonal vectors in $O(\log N)$ dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences.

PDF HTML Abstract

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

The paper under discussion presents a rigorous investigation into the representational prowess of Transformers versus Recurrent Neural Networks (RNNs) across several computational tasks, including index lookup, bounded Dyck languages, string equality, and nearest neighbor problems. This paper provides theoretical underpinnings for understanding the inherent differences between these two widely used architectures in terms of memory requirements and computational efficiency.

One of the primary contributions is the demonstration of distinct separation results between these architectures concerning model size relative to input length. For the index lookup task, the authors illustrate that a one-layer Transformer with poly-logarithmic model size can adequately perform, whereas a recurrent model necessitates a hidden state of linear size relative to input length. Such a distinction is grounded in the ability of Transformers to leverage attention mechanisms for performing token retrieval efficiently, which RNNs struggle to achieve without substantial model size.

Conversely, when examining bounded Dyck languages, a haLLMark of hierarchical dependencies akin to natural language, one-layer Transformers require linear model size, whereas prior work has shown that constant-sized RNNs suffice. This inverse separation highlights the RNN's effectiveness in tasks with inherent temporal or sequential dependencies. The authors bolster these theoretical findings with communication complexity arguments—specifically constructing communication protocols to demonstrate the lower bounds for both architectures.

The investigation extends into Boolean functions and associative recall, with noteworthy implications. While one-layer Transformers and RNNs require linear size to represent complex Boolean functions, such as Equality or Disjointness, the paper reveals that two-layer Transformers circumvent these limitations. Indeed, the authors demonstrate that small-sized two-layer Transformers can represent both the Nearest Neighbor task and broader classes of Boolean functions, hinting at layers' synergistic capacity to address expressive requirements inadequately served by shallow architectures.

Experimentally, the empirical validation corroborates the theoretical predictions particularly in tasks like Index Lookup, where one-layer Transformers significantly outperform RNNs of comparable or larger size. This consistent alignment between empirical results and theoretical conjecture reinforces the conclusions drawn regarding the expressive capacities of these architectures.

In summation, the paper offers a profound theoretical framework that elucidates the differing computational efficiencies of Transformers and RNNs in finite-precision settings. By leveraging insights from communication complexity, the authors translate intuitive architectural differences into formal mathematical truths. Looking ahead, one burgeoning research avenue that these findings invite is a deeper investigation of multi-layer Transformers, further dissecting how depth and attention intricately compound to surpass recurrent models. The paper posits an open frontier in the scalability and optimization of LLMs, hinting at new paradigms in architecture design and deployment for varied NLP and AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Satwik Bhattamishra (13 papers)
Michael Hahn (48 papers)
Phil Blunsom (87 papers)
Varun Kanade (41 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/satwik1729/status/1811790869638296002