- The paper demonstrates that two-layer Transformers can copy exponentially longer sequences than their size, outperforming GSSMs limited by fixed latent states.
- The study employs rigorous theoretical analysis and experiments with 160M-parameter models to compare architectural trade-offs in memory and generalization.
- Empirical results on pretrained LLMs confirm Transformers' superior ability in storage and retrieval tasks even with lower perplexity scores.
Introduction
Within the domain of sequence modeling, Transformers have set remarkable performance benchmarks on numerous tasks. A stream of research looking to innovate beyond Transformers has introduced Generalized State Space Models (GSSMs) as an alternative offering potential gains in inference-time efficiency. This paper rigorously examines the prominent claims regarding the potential of GSSMs compared to Transformers.
Theoretical Analysis
A central part of the paper is dedicated to a theoretical analysis focusing on the task of string copying—a simple paradigmatic task serving as a litmus test for model capabilities. The authors constructively prove that a two-layer Transformer can copy sequences exponentially longer than its size, capitalizing on its ability to store and retrieve information. Conversely, GSSMs, constrained by their fixed-size latent state, fundamentally lack this capacity; an assertion made based on a state space size analysis vis-à-vis sequence length.
Empirical Validation
Furthering the theoretical insights, the authors embark on substantial empirical studies involving models with approximately 160 million parameters. The outcomes emphatically favor Transformers, which demonstrate not only superior efficiency during training but also more robust generalization capabilities for synthetic tasks requiring context copying. These experiments also reveal the underlying "storage" and retrieval mechanism employed by Transformers, aligning with the authors' theoretical exposition.
Performance on Pre-trained Models
Extending the investigation to pretrained LLMs, the paper evaluates the copying and information retrieval abilities of large-scale models. Despite similar or lower perplexity, GSSMs consistently lag behind Transformers in tasks necessitating extensive access to context. This gap in performance accentuates the significance of architectural choices, which, as demonstrated, can affect LLMs' capabilities beyond training perplexity measures.
Conclusions
In the span of this paper, the authors lay forth compelling evidence—both theoretical and empirical—solidifying Transformers' superiority over GSSMs in carrying out tasks that require intricate interactions with the input context. While GSSMs indicate improved computational efficiency against sequence length, they fall short in vital cognitive capabilities, such as memorization and retrieval—skills in which Transformers excel. This paper delineates these distinct capabilities between the two architectures, contributing to the nuanced understanding of the trade-offs involved in sequence modeling architectures.