- The paper introduces a novel, scratchpad-free training method that enables a tiny Transformer to generalize addition across arbitrary digit lengths, achieving 100% accuracy on 1000-digit tasks.
- It employs a right-to-left autoregressive generation technique combined with two distinct training instance types to effectively manage carry-over operations.
- The research showcases practical advances in arithmetic generalization for small Transformers, paving the way for broader applications in numerical computation.
Arbitrary Length Generalization for Addition
Abstract
This paper presents a novel training methodology that enables a small Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. Utilizing an autoregressive generation technique that processes from right to left, this method mirrors common manual methods for adding large numbers. The proposed approach does not rely on scratchpads, and the results achieved demonstrate 100% accuracy in adding numbers up to one thousand digits. The corresponding R code for reproducing the results is available at the provided GitHub repository.
Introduction
Transformer architectures, as introduced by Vaswani et al. (2017), are designed to handle various NLP tasks. However, as Nogueira et al. (2021) demonstrated, Transformers struggle with generalizing simple arithmetic operations like addition when the numbers exceed training length. While previous works such as Nye et al. (2021) and Lee et al. (2024) have explored using scratchpads to aid in this process, they fall short in generalizing to numbers of arbitrary digit lengths. This paper introduces a novel, scratchpad-free training approach, proposing specially designed training instances to aid in generalization. The results consistently showed 100% accuracy for numbers up to one thousand digits.
Representative Training Instances
The foundation of the training data comprises two distinct types of training instances, which are critical for enabling the model to perform addition of numbers with arbitrary lengths.
Instances of the First Type
- Simple additions involving single-digit numbers, with the target being the final solution without using scratchpads.
- Examples include:
- Input:
12, Target: 3S for $1 + 2 = 3$
- Input:
83, Target: 11S for $8 + 3 = 11$
Instances of the first type illustrate the basic operation but fail to generalize to arbitrary digit lengths due to the absence of carry-over strategy demonstration.
Instances of the Second Type
- These combine previous outputs with the next two single-digit numbers, incorporating carry-over information.
- Examples include:
- Input:
3C14, Target: 5S for $1 + 4 + 0 = 5$ in $11 + 42 = 53$
- Input:
15C29, Target: 12S for $2 + 9 + 1 = 12$ in $26 + 99 = 125$
By integrating these two types of training instances, the model learns to handle carry-overs and scale the addition task across more significant digit lengths.
Model Setting
The employed model adheres to a decoder-only configuration, implemented using the R statistical software and the torch package. The input and initial output matrices are concatenated and converted via embeddings to a dimension of 64, with positional encodings applied next. Key aspects include:
- Multi-head attention with two heads
- Feed-forward network (FFN) with a hidden dimension of 256 and two layers
- Final linear weight converting the embedding to the vocabulary dimension
Training parameters utilized the Adam optimizer with a weight decay of 0.01 and a learning rate of 5×10−4, incorporating dropout rates of 20% for attention and FFN mechanisms.
Generating Tokens from the Trained Model
The model generates tokens autoregressively, starting from an initial output token |n| and stopping when S is generated. This approach ensures intermediate results are correctly processed, incorporating carry-overs efficiently.
Example
For $65785 + 8765$:
- Initial input:
55, output: 10S
- Transition input:
10C86, output: 15S
- Proceeding similarly, the final result generated is correctly deduced as
74550.
Experiment
More than one thousand addition tasks involving numbers up to one thousand digits each were tested, yielding 100% accuracy across all scenarios.
The methodology outlined in this paper demonstrates a robust training framework capable of enabling small Transformer models to generalize addition across arbitrary digit lengths effectively. By employing a combination of representative training instances, this approach bypasses the need for scratchpads and achieves remarkable accuracy. Future research could investigate optimizing the ratio of instance types and reducing autoregressive generations for more efficient computations.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you Need. In Advances in Neural Information Processing Systems (NeurIPS).
- Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., Papailiopoulos, D. (2024). Teaching Arithmetic to Small Transformers. OpenReview. Link
- R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Link
- Nogueira, R., Jiang, Z., Lin, J. (2021). Investigating the Limitations of the Transformers with Simple Arithmetic Tasks. Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR.
- Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. (2021). Show your work: Scratchpads for Intermediate Computation with LLMs. arXiv preprint (Nye et al., 2021).