Universal Length Generalization with Turing Programs (2407.03310v1)

Published 3 Jul 2024 in cs.LG

Abstract: Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current LLMs. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.

PDF HTML Abstract

Universal Length Generalization with Turing Programs

The paper "Universal Length Generalization with Turing Programs," authored by Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, and Eran Malach, introduces a novel approach to enhancing length generalization capabilities of transformer-based models, particularly on algorithmic tasks. This work proposes the use of Turing Programs as an extension to the Chain-of-Thought (CoT) methodology, enabling the transformer to break down algorithmic tasks into steps reminiscent of a Turing Machine. This essay outlines the key contributions, empirical results, and theoretical implications of the research.

The primary motivation behind the paper is the observed limitation of transformers in generalizing to longer sequences than those encountered during training, termed length generalization. While various techniques and modifications have been explored to address this, they often cater to specific tasks and exhibit limited universality. The authors introduce Turing Programs, a versatile scratchpad strategy that mimics the stepwise computation process of a Turing Machine, thus being adaptable to a wide spectrum of algorithmic tasks. This method revolves around copying text from the context with minimal modifications, maintaining simplicity and universality.

Key Contributions

Proposing Turing Programs: Turing Programs effectively formalize the decomposition of algorithmic tasks into sequential steps, each represented as a slightly modified copy of the prior step, akin to the operation of a Turing Machine.
Empirical Validation: The authors demonstrate robust length generalization using Turing Programs across multiple algorithmic tasks, notably addition, multiplication, and in-context SGD, showcasing significant improvement in accuracy when extrapolating from shorter to longer sequences.
Theoretical Foundation: A proof is provided that transformers can simulate Turing Programs by constructing RASP programs that mimic arbitrary Turing machines, establishing the theoretical feasibility of applying Turing Programs in practice.

Empirical Results

The research places a strong emphasis on empirical validation. Noteworthy numerical results include achieving 98% accuracy on 100-digit addition tasks when trained on 50-digit examples. Similarly, the method extends to multiplication tasks, both with 1-digit and 3-digit operands, where the models exhibit over 97% accuracy on 100-digit operations. Additionally, training transformers to execute Stochastic Gradient Descent (SGD) on linear regression tasks shows promising generalization from 50 to 80 examples.

Theoretical Implications and Future Directions

From a theoretical standpoint, the inclusion of Hard-ALiBi positional encoding, which constrains attention to a fixed number of recent tokens, is highlighted as a pivotal factor enabling such generalization. The authors' construction of a RASP framework to simulate Turing machines underscores the computational universality of the proposed Turing Programs, bridging the gap between theoretical models and practical implementations.

The implications of this research extend to the broader landscape of AI, suggesting that leveraging CoT frameworks aligned with Turing-complete models can enhance the robustness and adaptability of transformers. Future developments could explore optimizing the Turing Programs and positional encoding further, potentially increasing the efficiency and applicability of these methods across more complex and diverse algorithmic challenges.

In conclusion, this paper represents a rigorous advancement in the understanding and implementation of length generalization in transformers. By introducing Turing Programs, the authors have presented a robust, theoretically sound, and empirically validated approach, setting a foundational precedent for future explorations in the domain of algorithmic reasoning within artificial intelligence.