Computation Mechanism Behind LLM Position Generalization (2503.13305v2)

Published 17 Mar 2025 in cs.CL and cs.AI

Abstract: Most written natural languages are composed of sequences of words and sentences. Similar to humans, LLMs exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs' computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs' position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs' internal mechanisms.

Summary

The paper reveals that large language models achieve position generalization through a learned disentanglement of positional and semantic components within their attention mechanisms.
Theoretical analysis supports this disentanglement by approximating attention logits with a ternary linear model, showing a high linear correlation (0.959) between learned components.
Experimental validation demonstrates LLMs' robustness to position and feature transpositions, confirming practical rigidity tolerance with minimal impact on downstream task performance.

An Examination of Position Generalization in LLMs

The phenomenon of position generalization in LLMs parallels human linguistic capabilities, wherein both systems demonstrate a tolerance for positional perturbations in textual data. Position generalization refers to the ability of LLMs to understand and appropriately respond to text sequences that have differences in positional configurations from those encountered during training. The paper "Computation Mechanism Behind LLM Position Generalization" offers a computational perspective on this phenomenon, highlighting the disentanglement of positional and semantic components within the models’ attention mechanisms.

Positional Tolerance and Attention Disentanglement

The paper begins by identifying and exploring the capacity of LLMs to manage textual content with altered word positions and lengths beyond their training data. This capacity is situated within an exploration of the self-attention mechanism—a core component of models like Transformers that facilitates information processing. The authors reveal an intriguing property of modern LLMs: despite the sophisticated design of the self-attention mechanism, LLMs learn a disentanglement of attention logits, which can be linearly decomposed into two components related to positional and semantic relevance. Remarkably, this linear decomposition achieves a significant linear correlation coefficient of 0.959. This indicates that attentional processes within the models employ a counterintuitive simplicity: the semantics and the positioning of the content exert their influence on the model's attention computations independently.

Theoretical Foundations

To explain this disentanglement, the authors provide theoretical evidence rooted in the patterns observed in the intermediate representations of LLMs. These patterns, divergent from what one might predict based on the random initialization of model parameters, suggest a learned behavior rather than a structural artifact of the model design. This theoretical grounding involves approximating the attention logits through a ternary linear model, connecting the interaction between query, key, and positional elements of the LLM input sequences.

Practical Implications and Experimental Validation

Experimentally, the paper demonstrates rigidity tolerance through various tests involving text and feature transposition, alongside positional encoding manipulations within LLMs. Findings show that LLMs, like humans, can withstand moderate permutations of word order, reflected in minimal impacts on model perplexity and downstream task performance. Such robustness of LLMs to order perturbation further supports the proposed mechanism of position and semantic disentanglement. The paper reports that even when transposing or altering nearly 5% of the input positional data, the downstream task performance sees only marginal declines.

Length Generalization in LLMs

Beyond positionally varied text, the paper also addresses how LLMs achieve length generalization, handling sequences vastly longer than observed during training. This is linked to techniques like LM-Infinite and InfLLM, which modify how relative positions are processed within the self-attention mechanism. The insight that features can be interpreted as a pool of semantic entities separate from positional information explains why models can stretch beyond their training lengths—positionally compatible outputs are achieved without drifting from the learned distribution.

Theoretical and Practical Contributions

The implications of this research are twofold: theoretically, it presents a nuanced understanding of the interaction between semantic meaning and text structure within LLMs. Practically, it lays the groundwork for enhancing LLM capabilities in real-world applications, offering strategies for developing more robust models against text variability and promoting more efficient and effective model designs for extended context processing.

The paper pioneers in delineating the position generalization of LLMs from a computational angle, prompting a deeper dive into the architectural and training methodologies that afford such flexibility. Future work could focus on refining these insights further to optimize LLM architectures for even broader applications in artificial intelligence.

Tweets

https://twitter.com/fly51fly/status/1903561587543388369

https://twitter.com/GptMaestro/status/1903920611896013031