The paper investigates the length generalization capabilities of Transformers, specifically in the context of -digit decimal addition. Length generalization is defined as the ability to extrapolate from shorter training sequences to longer test sequences. The paper explores how different position encodings and data formats impact the Transformer's ability to perform length generalization.
The authors demonstrate that a standard Transformer architecture can generalize to sequence lengths longer than those seen during training, achieving high accuracy in 100-digit addition after being trained on addition problems up to 40 digits. This result was achieved through a combination of:
- Functional Interpolation for Relative Positions Encoding (FIRE) position encodings
- Randomized positions
- Reversed format
- Index hints
However, the paper also highlights the fragility of length generalization, noting its sensitivity to factors such as random weight initialization and the order of training data.
The key contributions of the paper are:
- Demonstration of the marked influence of position encoding and data format on the success of length generalization, achieving extrapolation to lengths longer than the training data using FIRE position encodings.
- The effectiveness of data formatting and augmentation techniques in length generalization is contingent on the choice of position encoding.
- The discovery that length generalization is fragile and heavily relies on factors such as random weight initialization and training data order.
The paper evaluates various position encoding techniques, including:
- Absolute Positional Encoding (APE)
- Additive Relative Positional Encoding (RPE)
- T5
- ALiBi
- KerpleLog
- FIRE
- Rotary Positional Encoding (RoPE)
- No Positional Encoding (NoPE)
- Randomized Position Encoding
The paper also evaluates various data formatting techniques, including:
- Reversed format
- Index Hints
- Random Space Augmentation
The best model uses FIRE position encodings with randomized positions, in reversed format, with index hints. Ablation experiments were performed by removing each of these components.
The experimental setup involved training a 25M parameter Transformer model with 6 blocks, a 512 hidden size, and a feedforward layer with a hidden dimension of 2048. The model was trained using the AdamW optimizer with a weight decay of 0.1 and no dropout, for 50,000 steps, with a learning rate of 3e-4. The dataset consisted of 30M examples for training (input lengths 1-40) and 1,000 examples per input length for testing.
The results indicate that FIRE enables significantly better length generalization compared to other positional encodings. Index hints are crucial for length generalization, as models trained without them exhibit poor in-distribution generalization. The reversed format excels over the standard format across all position encodings. The efficacy of random space augmentation is position encoding-dependent, benefiting RoPE and KerpleLog but deteriorating NoPE and FIRE.
The paper also found that length generalization is sensitive to weight initialization and training data order, with high variance across different random seeds. Error analysis revealed no significant difference between errors with and without carry, suggesting that carry propagation does not impede length generalization.
The paper also presents the following results:
- With standard formatting, FIRE excels in length generalization, even matching RoPE in reverse format.
- The reversed format training leads to a sharp performance transition, reminiscent of the "grokking" phenomenon.
- Randomized PE enhances FIRE but degrades KerpleLog.
- Increasing the training length significantly improves length generalization in FIRE, achieving near-perfect accuracy at length 100.
- Model size variation has a minor effect on length generalization. Larger models slightly improve generalization in short digit regimes (1 to 10 and 1 to 20 digit additions) but yield mixed results in longer regimes.
- Higher weight decay values slightly enhance the likelihood of achieving effective length generalization.
In summary, this paper demonstrates that Transformers can achieve strong length generalization in the decimal addition task through a careful combination of position encoding and data formatting techniques. However, the generalization performance is fragile and sensitive to various factors, highlighting the need for further research in this area.