Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers Can Achieve Length Generalization But Not Robustly (2402.09371v1)

Published 14 Feb 2024 in cs.LG, cs.AI, and cs.CL
Transformers Can Achieve Length Generalization But Not Robustly

Abstract: Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for LLMs. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

The paper investigates the length generalization capabilities of Transformers, specifically in the context of NN-digit decimal addition. Length generalization is defined as the ability to extrapolate from shorter training sequences to longer test sequences. The paper explores how different position encodings and data formats impact the Transformer's ability to perform length generalization.

The authors demonstrate that a standard Transformer architecture can generalize to sequence lengths 2.5×2.5\times longer than those seen during training, achieving high accuracy in 100-digit addition after being trained on addition problems up to 40 digits. This result was achieved through a combination of:

  • Functional Interpolation for Relative Positions Encoding (FIRE) position encodings
  • Randomized positions
  • Reversed format
  • Index hints

However, the paper also highlights the fragility of length generalization, noting its sensitivity to factors such as random weight initialization and the order of training data.

The key contributions of the paper are:

  • Demonstration of the marked influence of position encoding and data format on the success of length generalization, achieving extrapolation to lengths 2.5×2.5\times longer than the training data using FIRE position encodings.
  • The effectiveness of data formatting and augmentation techniques in length generalization is contingent on the choice of position encoding.
  • The discovery that length generalization is fragile and heavily relies on factors such as random weight initialization and training data order.

The paper evaluates various position encoding techniques, including:

  • Absolute Positional Encoding (APE)
  • Additive Relative Positional Encoding (RPE)
    • T5
    • ALiBi
    • KerpleLog
    • FIRE
  • Rotary Positional Encoding (RoPE)
  • No Positional Encoding (NoPE)
  • Randomized Position Encoding

The paper also evaluates various data formatting techniques, including:

  • Reversed format
  • Index Hints
  • Random Space Augmentation

The best model uses FIRE position encodings with randomized positions, in reversed format, with index hints. Ablation experiments were performed by removing each of these components.

The experimental setup involved training a 25M parameter Transformer model with 6 blocks, a 512 hidden size, and a feedforward layer with a hidden dimension of 2048. The model was trained using the AdamW optimizer with a weight decay of 0.1 and no dropout, for 50,000 steps, with a learning rate of 3e-4. The dataset consisted of 30M examples for training (input lengths 1-40) and 1,000 examples per input length for testing.

The results indicate that FIRE enables significantly better length generalization compared to other positional encodings. Index hints are crucial for length generalization, as models trained without them exhibit poor in-distribution generalization. The reversed format excels over the standard format across all position encodings. The efficacy of random space augmentation is position encoding-dependent, benefiting RoPE and KerpleLog but deteriorating NoPE and FIRE.

The paper also found that length generalization is sensitive to weight initialization and training data order, with high variance across different random seeds. Error analysis revealed no significant difference between errors with and without carry, suggesting that carry propagation does not impede length generalization.

The paper also presents the following results:

  • With standard formatting, FIRE excels in length generalization, even matching RoPE in reverse format.
  • The reversed format training leads to a sharp performance transition, reminiscent of the "grokking" phenomenon.
  • Randomized PE enhances FIRE but degrades KerpleLog.
  • Increasing the training length significantly improves length generalization in FIRE, achieving near-perfect accuracy at length 100.
  • Model size variation has a minor effect on length generalization. Larger models slightly improve generalization in short digit regimes (1 to 10 and 1 to 20 digit additions) but yield mixed results in longer regimes.
  • Higher weight decay values slightly enhance the likelihood of achieving effective length generalization.

In summary, this paper demonstrates that Transformers can achieve strong length generalization in the decimal addition task through a careful combination of position encoding and data formatting techniques. However, the generalization performance is fragile and sensitive to various factors, highlighting the need for further research in this area.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yongchao Zhou (7 papers)
  2. Uri Alon (40 papers)
  3. Xinyun Chen (80 papers)
  4. Xuezhi Wang (64 papers)
  5. Rishabh Agarwal (47 papers)
  6. Denny Zhou (65 papers)
Citations (27)