The Role of Sparsity for Length Generalization in Transformers

Published 24 Feb 2025 in cs.LG, cs.AI, and cs.CL | (2502.16792v1)

Abstract: Training LLMs to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call $k$-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model justifies certain techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is a ``sparse'' dependency structure of each token on the previous ones. Inspired by our theory, we introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which position coupling can successfully be applied to achieve length generalization.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a theoretical framework demonstrating that length generalization arises when transformers predict tokens using a fixed number of previous tokens via k-sparse planted correlations.
It employs sparse functional attention and a novel predictive position coupling (PPC) to relax locality assumptions and enhance generalization on both synthetic and natural language tasks.
Empirical results on tasks like sparse parity reveal near-perfect generalization under small training sparsity, underscoring the critical role of sparsity in transformer performance.

Sparsity and Length Generalization in Transformers

This paper introduces a theoretical framework to study length generalization in decoder-only transformers for next-token prediction. The key idea is that length generalization occurs when each predicted token depends on a small, fixed number of previous tokens, formalized as $k$ -sparse planted correlation distributions. The authors demonstrate that an idealized transformer model with generalizing attention heads can successfully length-generalize on such tasks and provide theoretical justifications for techniques like position coupling.

Key Concepts and Definitions

The paper introduces several key concepts to formalize the problem of length generalization:

$k$ -sparse planted correlations: A class of data distributions where each token depends on a small number $k$ of previous tokens. This captures the intuition that many tasks have a sparse dependency structure.
Sparse functional attention: A class of models generalizing attention heads that attend to subsets of $k$ tokens. This is an idealized model of transformers.
Length generalization: The ability of a model trained on sequences of length $\leq L$ to accurately predict tokens in sequences of length $\bar L > L$ .

The formal definition of $k$ -sparse planted correlations is given as follows:

Definition Fix a positive integer $k \in \mathbb{N}$ . We say that a distribution ensemble $\mathbb{P} = (\mathbb{P}_\ell)_{\ell \in \mathbb{N}}$ has $k$ -sparse planted correlations if there are distributions $\mu \in \Delta(\mathcal{V})$ , $\mathbb{Q}_{pos_\ell} \in \Delta(\text{Sets}([\ell], k))$ for $\ell \in \mathbb{N}$ , $\mathbb{Q}_{voc} \in \Delta(\mathcal{V}^k)$ , and a function $g^* : \mathcal{V}^k \to \mathcal{Y}$ so that the following holds. For each $\ell \in \mathbb{N}$ , a sample $(\mathbf{X}, \mathbf{Y}) \sim \mathbb{P}_\ell$ may be drawn as follows: first, we draw $S^* \sim \mathbb{Q}_{pos_\ell}, \mathbf{Z} \sim \mathbb{Q}_{voc}$ , and we set:

$\mathbf{X}_{S^*} = \mathbf{Z}, \qquad \mathbf{X}_i \sim \mu \ \forall i \notin S^*, \qquad \mathbf{Y} = g^*(\mathbf{Z})$

This definition highlights the core idea that only a small subset ( $k$ ) of tokens are relevant for predicting the next token.

Theoretical Results

The paper presents two main theoretical results:

Provable length generalization: Under certain assumptions, a sparse functional attention class can achieve length generalization with respect to a distribution ensemble with sparse planted correlations.
Position coupling: A theoretical abstraction of position coupling can remove the locality requirement, providing a justification for this technique.
Figure 1: Parity with scratchpad and \predpc.

The first result (\cref{thm:length-extrap} in the paper) relies on two key assumptions:

Locality: The attention mechanism only attends to tokens within a local context.
Bounded coverage: The distributions of position embeddings have bounded coverage.

The second result shows that position coupling can relax the locality assumption, which is often violated in practice.

Experimental Validation

The theoretical results are supported by experiments on synthetic tasks and natural language data. The synthetic tasks, such as sparse parity, are designed to control the sparsity of the dependency structure. The results show that length generalization improves with decreasing sparsity. For natural language data, the paper provides evidence that length-generalizing transformers make accurate predictions using a small number of past tokens.

The sparse parity task involves predicting the parity of $k$ bits within a sequence of length $2\ell$ . The results demonstrate that when the training sparsity $K_{train}$ is small enough, the model exhibits near-perfect length generalization up to lengths of 500. However, performance deteriorates rapidly for test sparsity values $k_{test} > K_{train}$ .

Inspired by the theory, the authors introduce Predictive Position Coupling (\PPC), a modification of positional coupling that works on tasks where the coupled position IDs are input-dependent. Experiments on a variable assignment task demonstrate that \PPC enables significant length generalization.

Predictive Position Coupling

Predictive Position Coupling (\PPC) is introduced as a novel technique to extend the applicability of position coupling to tasks where the coupled position IDs are input-dependent. Unlike standard position coupling, \PPC trains the transformer to predict the coupled position ID for each next token. This is achieved by adding an additional output embedding module that predicts both the next token ID and its corresponding coupled position ID.

Figure 2: Absolute positional embeddings with random shift.

The implementation of \PPC involves the following key steps:

Architecture Modification: Augment the transformer architecture to include an additional output embedding layer for predicting the coupled position ID.
Training Process: Train the model to predict both the next token and its coupled position ID simultaneously.
Inference Phase: At generation time, feed the predicted token and coupled position ID as the next input token and position ID, respectively.

The experimental results on tasks like variable assignment demonstrate that \PPC significantly improves length generalization compared to traditional position coupling methods.

Implications and Future Directions

The paper's findings have several important implications:

Sparsity is a key factor in length generalization. Models should be designed to exploit sparse dependency structures.
Position coupling is a powerful technique for improving length generalization, and \PPC extends its applicability.
The theoretical framework provides a foundation for understanding and improving length generalization in transformers.

Future research directions include:

Extending the theoretical framework to more complex models and tasks.
Investigating the role of sparsity and locality in other types of generalization.
Developing new techniques for exploiting sparse dependency structures in transformers.

Conclusion

This paper provides a valuable contribution to the understanding of length generalization in transformers. The theoretical framework, supported by experimental results, highlights the importance of sparsity and locality. The introduction of \PPC is a promising step towards enabling length generalization on a wider range of tasks.