Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Repeat After Me: Transformers are Better than State Space Models at Copying (2402.01032v2)

Published 1 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained LLMs and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

Citations (48)

Summary

  • The paper demonstrates that two-layer Transformers can copy exponentially longer sequences than their size, outperforming GSSMs limited by fixed latent states.
  • The study employs rigorous theoretical analysis and experiments with 160M-parameter models to compare architectural trade-offs in memory and generalization.
  • Empirical results on pretrained LLMs confirm Transformers' superior ability in storage and retrieval tasks even with lower perplexity scores.

Introduction

Within the domain of sequence modeling, Transformers have set remarkable performance benchmarks on numerous tasks. A stream of research looking to innovate beyond Transformers has introduced Generalized State Space Models (GSSMs) as an alternative offering potential gains in inference-time efficiency. This paper rigorously examines the prominent claims regarding the potential of GSSMs compared to Transformers.

Theoretical Analysis

A central part of the paper is dedicated to a theoretical analysis focusing on the task of string copying—a simple paradigmatic task serving as a litmus test for model capabilities. The authors constructively prove that a two-layer Transformer can copy sequences exponentially longer than its size, capitalizing on its ability to store and retrieve information. Conversely, GSSMs, constrained by their fixed-size latent state, fundamentally lack this capacity; an assertion made based on a state space size analysis vis-à-vis sequence length.

Empirical Validation

Furthering the theoretical insights, the authors embark on substantial empirical studies involving models with approximately 160 million parameters. The outcomes emphatically favor Transformers, which demonstrate not only superior efficiency during training but also more robust generalization capabilities for synthetic tasks requiring context copying. These experiments also reveal the underlying "storage" and retrieval mechanism employed by Transformers, aligning with the authors' theoretical exposition.

Performance on Pre-trained Models

Extending the investigation to pretrained LLMs, the paper evaluates the copying and information retrieval abilities of large-scale models. Despite similar or lower perplexity, GSSMs consistently lag behind Transformers in tasks necessitating extensive access to context. This gap in performance accentuates the significance of architectural choices, which, as demonstrated, can affect LLMs' capabilities beyond training perplexity measures.

Conclusions

In the span of this paper, the authors lay forth compelling evidence—both theoretical and empirical—solidifying Transformers' superiority over GSSMs in carrying out tasks that require intricate interactions with the input context. While GSSMs indicate improved computational efficiency against sequence length, they fall short in vital cognitive capabilities, such as memorization and retrieval—skills in which Transformers excel. This paper delineates these distinct capabilities between the two architectures, contributing to the nuanced understanding of the trade-offs involved in sequence modeling architectures.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com