Effectiveness of partial rotary positional embedding for length extrapolation

Determine the effectiveness of applying Rotary Position Embedding (RoPE) to only a subset of attention head dimensions—thereby combining RoPE and no positional embedding (NoPE) as done in GPT-J and GPT-NeoX—for length extrapolation in transformer language models, and characterize its behavior across long-context settings.

Background

The paper analyzes attention mechanisms and positional encoding strategies for long-context LLMs and proposes a hybrid architecture (RNoPE-SWA) that interleaves NoPE and RoPE layers with sliding-window attention in RoPE layers. This design aims to preserve strong retrieval behavior while leveraging recency bias for local processing and computational efficiency.

Earlier transformer variants, including GPT-J and GPT-NeoX, explored an alternative way of combining RoPE and NoPE by applying rotational embeddings to only a portion of the head dimensions (a partial-rotary approach). The authors explicitly note that, despite prior use, it is still unknown how effective this partial-rotary design is for length extrapolation, marking it as an open research direction relevant to long-context modeling.

References

Earlier transformer variants, such as GPT-J and GPT-NeoX , also explored combining RoPE and NoPE by applying rotational embeddings to a portion of the head dimensions. However, the effectiveness of these approaches in length extrapolation remains an open question and an area of ongoing research.

— Rope to Nope and Back Again: A New Hybrid Attention Strategy (2501.18795 - Yang et al., 30 Jan 2025) in Section 3, Model Architecture

Effectiveness of partial rotary positional embedding for length extrapolation

Background

References

Related Problems