Transformers need glasses! Information over-squashing in language tasks (2406.04267v2)

Published 6 Jun 2024 in cs.CL and cs.LG

Abstract: We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier LLMs. We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer LLMs can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

PDF HTML Abstract

Information Over-squashing in Language Tasks: Analytical Insights into Decoder-only Transformers

This paper presents a detailed paper on the propagation of information within decoder-only Transformers, which are the fundamental architecture behind most LLMs. The core investigation centers around representational collapse—a phenomenon where distinct input sequences yield nearly identical final token representations, particularly exacerbated by the low-precision floating-point formats commonly used in LLMs.

Key Findings

Representational Collapse: The authors demonstrate that sequences differing only at their last token can produce nearly identical representations in the final layer. This convergence poses significant challenges, particularly in tasks requiring distinct responses, such as counting or copying sequences. The analysis employs theoretical signal propagation methods to establish that the total variation of softmax distributions of such sequences tends to zero as sequence length increases.
Over-squashing: The paper draws parallels between over-squashing in graph neural networks (GNNs) and the loss of sensitivity to specific input tokens in Transformers. The unidirectional causal mask in decoder-only models limits the paths through which information can be conveyed, henceforth leading to over-squashing, where significant information is compressed or lost when passing through network layers.
Empirical Evidence: Experiments conducted on state-of-the-art LLMs, such as Gemini and Gemma models, showcase persistent failures in simple tasks like counting and copying. The patterns of errors substantiate the theoretical claims, revealing that the LLMs struggle with sequences of repetitive tokens due to representational collapse.
Impact of Floating Point Precision: The utilization of lower precision arithmetic amplifies these issues, bringing forth practical constraints when deploying LLMs in massively parallel environments. Quantization, used to speed up inference, critically affects the model’s ability to distinguish sequences, thus implicating the inherent limitations within current architectural formulations.

Implications and Solutions

The findings have important theoretical and practical implications. Theoretically, they present limitations within the expressive capabilities of decoder-only Transformers under realistic constraints, such as finite precision and causal masking. Practically, they underscore the need for architectural innovations to address representational collapse and over-squashing in LLMs.

To mitigate these problems, the paper suggests introducing additional tokens to break repetitive patterns. While this provides a straightforward procedural fix, broader architectural adaptations, such as attention mechanism adjustments, may further enhance model robustness against these phenomena.

Future Directions

The paper paves the way for further inquiries into information propagation in neural architectures. Future research might explore:

Advanced positional encodings and their effects on representational robustness.
Innovative attention mechanisms that minimize over-squashing without sacrificing computational efficiency.
Larger empirical studies connecting theoretical insights with practical Transformer deployments in diverse applications.

By providing a formal understanding of information dynamics within Transformers, this paper contributes significantly to the foundational knowledge base surrounding LLM designs, driving future research endeavors toward improving the efficacy and reliability of these influential models.