Arbitrary-Length Generalization for Addition in a Tiny Transformer (2406.00075v2)

Published 31 May 2024 in cs.LG, stat.AP, and stat.ML

Abstract: This paper introduces a novel training methodology that enables a Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. The proposed approach employs an autoregressive generation technique, processing from right to left, which mimics a common manual method for adding large numbers. To the best of my knowledge, this methodology has not been previously explored in the literature. All results are reproducible, and the corresponding R code is available at github.com/AGPatriota/ALGA-R/.

Abstract PDF HTML Chat (Pro)

Summary

The paper introduces a novel, scratchpad-free training method that enables a tiny Transformer to generalize addition across arbitrary digit lengths, achieving 100% accuracy on 1000-digit tasks.
It employs a right-to-left autoregressive generation technique combined with two distinct training instance types to effectively manage carry-over operations.
The research showcases practical advances in arithmetic generalization for small Transformers, paving the way for broader applications in numerical computation.

Arbitrary Length Generalization for Addition

Abstract

This paper presents a novel training methodology that enables a small Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. Utilizing an autoregressive generation technique that processes from right to left, this method mirrors common manual methods for adding large numbers. The proposed approach does not rely on scratchpads, and the results achieved demonstrate 100% accuracy in adding numbers up to one thousand digits. The corresponding R code for reproducing the results is available at the provided GitHub repository.

Introduction

Transformer architectures, as introduced by Vaswani et al. (2017), are designed to handle various NLP tasks. However, as Nogueira et al. (2021) demonstrated, Transformers struggle with generalizing simple arithmetic operations like addition when the numbers exceed training length. While previous works such as Nye et al. (2021) and Lee et al. (2024) have explored using scratchpads to aid in this process, they fall short in generalizing to numbers of arbitrary digit lengths. This paper introduces a novel, scratchpad-free training approach, proposing specially designed training instances to aid in generalization. The results consistently showed 100% accuracy for numbers up to one thousand digits.

Representative Training Instances

The foundation of the training data comprises two distinct types of training instances, which are critical for enabling the model to perform addition of numbers with arbitrary lengths.

Instances of the First Type

Simple additions involving single-digit numbers, with the target being the final solution without using scratchpads.
Examples include:
- Input: 12, Target: 3S for $1 + 2 = 3$
- Input: 83, Target: 11S for $8 + 3 = 11$

Instances of the first type illustrate the basic operation but fail to generalize to arbitrary digit lengths due to the absence of carry-over strategy demonstration.

Instances of the Second Type

These combine previous outputs with the next two single-digit numbers, incorporating carry-over information.
Examples include:
- Input: 3C14, Target: 5S for $1 + 4 + 0 = 5$ in $11 + 42 = 53$
- Input: 15C29, Target: 12S for $2 + 9 + 1 = 12$ in $26 + 99 = 125$

By integrating these two types of training instances, the model learns to handle carry-overs and scale the addition task across more significant digit lengths.

Model Setting

The employed model adheres to a decoder-only configuration, implemented using the R statistical software and the torch package. The input and initial output matrices are concatenated and converted via embeddings to a dimension of 64, with positional encodings applied next. Key aspects include:

Multi-head attention with two heads
Feed-forward network (FFN) with a hidden dimension of 256 and two layers
Final linear weight converting the embedding to the vocabulary dimension

Training parameters utilized the Adam optimizer with a weight decay of 0.01 and a learning rate of $5 \times 10^{-4}$ , incorporating dropout rates of 20% for attention and FFN mechanisms.

Generating Tokens from the Trained Model

The model generates tokens autoregressively, starting from an initial output token |n| and stopping when S is generated. This approach ensures intermediate results are correctly processed, incorporating carry-overs efficiently.

Example

For $65785 + 8765$:

Initial input: 55, output: 10S
Transition input: 10C86, output: 15S
Proceeding similarly, the final result generated is correctly deduced as 74550.

Experiment

More than one thousand addition tasks involving numbers up to one thousand digits each were tested, yielding 100% accuracy across all scenarios.

Final Remarks

The methodology outlined in this paper demonstrates a robust training framework capable of enabling small Transformer models to generalize addition across arbitrary digit lengths effectively. By employing a combination of representative training instances, this approach bypasses the need for scratchpads and achieves remarkable accuracy. Future research could investigate optimizing the ratio of instance types and reducing autoregressive generations for more efficient computations.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you Need. In Advances in Neural Information Processing Systems (NeurIPS).
Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., Papailiopoulos, D. (2024). Teaching Arithmetic to Small Transformers. OpenReview. Link
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Link
Nogueira, R., Jiang, Z., Lin, J. (2021). Investigating the Limitations of the Transformers with Simple Arithmetic Tasks. Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR.
Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. (2021). Show your work: Scratchpads for Intermediate Computation with LLMs. arXiv preprint (Nye et al., 2021).