Looped Transformers for Length Generalization (2409.15647v2)

Published 24 Sep 2024 in cs.LG

Abstract: Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.

Authors (4)

Ying Fan (47 papers)
Yilun Du (113 papers)
Kannan Ramchandran (129 papers)
Kangwook Lee (70 papers)

Citations (4)

View on Semantic Scholar

Summary

Looped Transformers for Length Generalization

"Looped Transformers for Length Generalization" addresses the pivotal issue that standard Transformer architectures face in handling inputs of arbitrary, unseen lengths. While conventional Transformers have demonstrated proficiency on various arithmetic and algorithmic tasks within fixed input lengths, they falter when engaging with length generalization. This paper proposes the utilization of looped Transformers—structures with an adaptive number of processing steps, thus enhancing their ability to generalize across diverse input lengths.

Problem Statement and Motivation

The paper sets out to address a foundational challenge in computational tasks that inherently involve variable input lengths. Examples include arithmetic operations, coding languages, and reasoning tasks, where the complexity commonly scales with input length. While models trained on data of bounded lengths exhibit strong performance within that boundary, they tend to deliver suboptimal results on inputs exceeding the training length. Traditional scaling methods involving augmenting computational resources and training data have yet to resolve the issue of length generalization effectively.

Architectural Approach

The core proposition in this work is a looped Transformer architecture designed to iteratively adjust the computational intensity based on the input complexity. Unlike vanilla Transformers with a fixed depth, looped Transformers deploy a recurrent mechanism. This enables multiple passes through the input sequence, refining intermediate outputs iteratively, and thus, better accommodating longer and more complex sequences.

Model Definition: The base model for this architecture remains a decoder-only Transformer. Each input is processed in a sequential looping mechanism where outputs from prior iterations inform the subsequent ones.
Task Considerations: The focus is on tasks that are reducible to multiple iterations of a RASP-L operation—a length-generalizable process executable by finite-sized Transformers.
Training Paradigm: Training is conducted end-to-end without intermediate supervision, relying purely on input-output pair data and fixed numbers of iterations. This method enables the Transformer to infer intermediate steps and adjust loop iterations based on task complexity.

Methodology

Problem Representation: The authors introduce $n$ -RASP-L programs, a novel class of algorithmic tasks that can be decomposed into iterations of a simpler RASP-L program. This framework encapsulates tasks like copying sequences, parity checking, and binary addition, all of which lack next-token prediction solutions.
Looped Transformer Training: The key to the proposed training mechanism lies in its ability to leverage step-based supervision without explicit intermediate step labels. Variable length sequences and corresponding iteration steps diversify the training data, fostering robust length generalizability.
Inference Techniques: Two stopping criteria for looped inferences are outlined:
- Oracle-based: Using a pre-defined step count.
- Maximum Confidence: Utilizing the cross-entropy loss as a heuristic to dynamically determine the number of iterations required during inference.

Experimental Results

The empirical evaluation elucidates the efficacy of looped Transformers relative to baselines like vanilla next-token prediction (NTP), NTP with pause tokens, and weight-tied NTP models. The looped model showcased superior length generalization across all studied tasks. Specific findings include:

Parity tasks demonstrated near-perfect generalization to inputs significantly longer than those seen during training.
Copy and Addition tasks, challenging for fixed-depth NTP models, were adeptly handled by the looped setup.
Variants like NTP-Pause and tied NTP layers offered incremental improvements but did not approach the flexibility and generalization capability of adaptive-depth looped Transformers.

Theoretical and Practical Implications

The implications extend both theoretically and practically:

Algorithmic Insight: Establishing that algorithmic tasks expressible as $n$ -RASP-L can be conclusively managed via looped Transformers offers a profound insight into how recursive architectures emulate iterative problem-solving mechanisms.
Scalability: Demonstrating such architectures' ability to handle progressively complex tasks opens avenues for their deployment in more computational and reasoning-intensive applications.

Conclusion and Future Work

This paper asserts looped Transformers as a formidable paradigm for overcoming the limitations of fixed-depth models in length generalization. By iteratively refining intermediate steps, the model naturally adapts to input complexities, crucial for diverse sequences. Future research may focus on extending the framework to more complex problem sets, improving training efficiency for extremely large loop iterations, and investigating alternative positional embeddings to further bolster performance.

In conclusion, this paper provides a robust methodological framework and substantial empirical evidence that looped Transformer architectures substantially enhance length generalization capabilities in various algorithmic tasks.