Effectiveness of partial rotary positional embedding for length extrapolation
Determine the effectiveness of applying Rotary Position Embedding (RoPE) to only a subset of attention head dimensions—thereby combining RoPE and no positional embedding (NoPE) as done in GPT-J and GPT-NeoX—for length extrapolation in transformer language models, and characterize its behavior across long-context settings.
References
Earlier transformer variants, such as GPT-J and GPT-NeoX , also explored combining RoPE and NoPE by applying rotational embeddings to a portion of the head dimensions. However, the effectiveness of these approaches in length extrapolation remains an open question and an area of ongoing research.
— Rope to Nope and Back Again: A New Hybrid Attention Strategy
(2501.18795 - Yang et al., 30 Jan 2025) in Section 3, Model Architecture