Extrapolation by Association: Length Generalization Transfer in Transformers (2506.09251v1)

Published 10 Jun 2025 in cs.CL and cs.AI

Abstract: Transformer LLMs have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization--the ability to extrapolate from shorter to longer inputs--through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained LLMs, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

Authors (5)

Ziyang Cai (5 papers)
Nayoung Lee (6 papers)
Avi Schwarzschild (35 papers)
Samet Oymak (94 papers)
Dimitris Papailiopoulos (59 papers)

Summary

The paper shows that leveraging auxiliary tasks allows transformers to extrapolate to longer sequences beyond the training range.
It validates the transfer effect across diverse tasks such as reverse addition, string manipulation, and maze navigation.
Experimental results indicate that shared attention heads are pivotal in facilitating extrapolation, paving the way for more robust AI models.

Extrapolation by Association: Length Generalization Transfer in Transformers

The paper "Extrapolation by Association: Length Generalization Transfer in Transformers" explores the nuanced domain of length generalization in transformer models, focusing particularly on the phenomenon where models trained on tasks at specific lengths can extrapolate effectively to longer sequences when leveraged with auxiliary tasks. The paper explores the mechanics of such generalization across various algorithmic domains, including arithmetic operations, string manipulations, and maze navigation.

Overview of Length Generalization Transfer

Transformer models have demonstrated robust generalization capabilities across various domains, yet understanding the intricacies of such skills remains an ongoing challenge. In this paper, the authors investigate length generalization—a form of out-of-distribution (OOD) generalization where models extrapolate from short to long inputs. They introduce the concept of length generalization transfer, where the ability to handle longer sequences in one task can be transferred to related tasks trained on shorter sequences.

The research showcases this phenomenon through experiments involving diverse synthetic tasks. By training models jointly on auxiliary tasks that require longer inputs, transformers exhibit an improved capacity to generalize on unseen, longer inputs in the target task. This length generalization transfer is noted across domains such as arithmetic operations (e.g., reverse addition), string manipulations (e.g., string copy and transformations), and maze-solving tasks.

Empirical Results and Claims

The paper presents empirical results demonstrating the capability of transformer models to inherit length generalization properties from auxiliary tasks. Notably, training on related tasks enables models to extend their generalization capabilities up to the length of the auxiliary task without the need for the main task to have been exposed to such lengths. For instance, in reverse addition tasks, the introduction of auxiliary arithmetic problems such as no-carry addition or reverse subtraction facilitated extrapolation beyond the training range.

Additionally, the paper highlights the role of similar computational structures, such as shared attention heads, in driving these transfer effects. The observation that models seem to leverage the same attention heads across tasks suggests a mechanistic underpinning to these capabilities.

Implications and Future Prospects in AI

This research provides valuable insights into the mechanics of generalization in transformer architectures, emphasizing the interplay between task associations and model capabilities. These findings have practical implications, suggesting that multitask training strategies may be harnessed to improve generalization for tasks that inherently differ in the complexity of lengths.

Moreover, the paper opens avenues for further exploration into other types of generalization, such as compositional reasoning and cross-domain transferability. Understanding how transformers can reuse learned structures across disparate tasks could inform the design of more versatile and adaptive models, driving progress in artificial intelligence towards models capable of handling increasingly complex and varied inputs.

Conclusion

By examining the transfer effects in transformer models, the paper sheds light on the underexplored potential of shared inductive structures. The insights derived suggest that a more strategic approach to multitask training could substantially enhance length generalization capability, thereby improving out-of-distribution performance in AI systems. This work contributes significantly to the body of knowledge on transformer generalization mechanisms and points towards novel methodologies for further optimizing model training in diverse and complex environments.

PDF Markdown

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos