PI-RoPE: Interpolated Rotary Encoding

Updated 27 July 2025

The paper introduces a scalable method that interpolates token positions to adapt rotary embedding angles, preserving attention stability and extending context length.
It refines the original RoPE by employing dynamic scaling and layer-specific optimization to improve positional resolution and mitigate rapid decay over long sequences.
Empirical evaluations demonstrate enhanced performance in language modeling, retrieval, and multimodal tasks, evidencing practical gains for extended-context transformer architectures.

Position-Interpolated Rotary Positional Encoding (PI-RoPE) and related rotary positional embedding methods represent a class of positional encoding techniques designed to embed both absolute and relative position information in transformer architectures using parameterized rotations in embedding space. PI-RoPE is closely tied to the original Rotary Position Embedding (RoPE) concept but refines and extends it to support longer sequence extrapolation, improved resolution, and compatibility with diverse model architectures by means of interpolation and scaling strategies. The method yields practical benefits for long-context modeling, extrapolation, and computational efficiency, and serves as a theoretical basis for further developments in positional encoding.

1. Mathematical Formulation and Rotary Encoding Principle

The core of rotary positional encoding is its multiplicative embedding scheme, where each token embedding is rotated via a sequence-dependent angle in 2D subspaces of the embedding dimension. In the standard case, let $w_m$ denote the (query/key) vector at position $m$ , and $\theta_i$ denote frequency parameters over $d/2$ 2-dimensional subspaces. The rotary transform is: $f_{q/k}(w_m, m) = R_{\Theta, m} w_m$ with

$R_{\Theta, m} = \bigoplus_{i=1}^{d/2} \begin{bmatrix} \cos(m \theta_i) & -\sin(m \theta_i) \ \sin(m \theta_i) & \cos(m \theta_i) \end{bmatrix}$

where $\theta_i = 10000^{-2(i-1)/d}$ .

In the attention mechanism, the dot product between two position-encoded tokens at positions $m, n$ is: $(R_{\Theta, m}w_m)^\top (R_{\Theta, n}w_n) = w_m^\top (R_{\Theta, m}^\top R_{\Theta, n}) w_n$ Since $R_{\Theta, m}^\top R_{\Theta, n}$ is the relative rotation for displacement $(n-m)$ , the attention calculation becomes a function of the relative position: crucially, the self-attention mechanism naturally encodes both absolute (per-token) and relative (inter-token) information as a consequence of this construction (Su et al., 2021, Li et al., 2021).

2. Position Interpolation for Extrapolation

Transformers pretrained with RoPE are limited in context length by the range of positions seen during training. When evaluated on longer sequences, their rotation angles can exceed those experienced during training, leading to numerical instability and degraded performance. Position interpolation addresses this by scaling position indices during inference so that their effective range matches that observed during training. Specifically, for new input positions $p$ over a new maximum length $L' > L$ (the training maximum), the rescaled position is

$p' = p \cdot \frac{L}{L'}$

This remapping compresses the effective angles for the rotary operation, keeping attention score magnitudes stable and significantly improving extrapolation capability (Al-Khateeb et al., 2023). Experiments on language modeling benchmarks demonstrated that this modification can double the effective input context handled with little performance degradation.

PI-RoPE can also be viewed as a form of dynamic scaling within the rotation: for each layer (or globally), the argument to $R_{\Theta, m}$ is replaced with $R_{\Theta, m/\sigma}$ for some scaling factor $\sigma$ determined by the input/output context length ratio, or via per-layer optimization (see §5).

3. Decay, Resolution, and Locus of Attention

A crucial intrinsic property of rotary encodings is the decay of the inner product between token representations with increasing relative distance. Mathematically, the real part of the self-attention score for a frequency band $\theta$ is proportional to $\cos((m-n)\theta)$ , which oscillates and decays as tokens become further apart—aligning with the inductive bias that distant tokens interact less strongly (Su et al., 2021). However, overly rapid decay can cause information in long sequences to be underrepresented or lost (the "lost-in-the-middle" problem (Wang et al., 6 Mar 2025)). Linear position interpolation slows down this decay, effectively "stretching out" the neighborhood of strong interactions.

Resolution is another critical distinction. When naively stretching positions to longer contexts (without interpolation), the effective positional resolution—the smallest distance between distinguishable positions—shrinks (as $\Delta\sim L_{\text{pretrain}}/L_{\text{test}}$ ). Interpolated RoPE (and 3D generalizations (Ma et al., 14 Jun 2024)) improves upon this by preserving or enhancing resolution through chunking and additional rotation dimensions.

4. Extensions, Generalizations, and Theoretical Foundations

The theoretical underpinnings of PI-RoPE and extensions such as STRING (Schenck et al., 4 Feb 2025), ComRoPE (Yu et al., 4 Jun 2025), and 3D-RPE (Ma et al., 14 Jun 2024) are rooted in Lie group and Lie algebra theory. RoPE and its generalizations are position-dependent orthogonal transformations—matrix exponentials of skew-symmetric generators ( $B$ ): $R(x) = \exp\left(\sum_{i} A_i x_i\right)$ These generators $A_i$ must commute to ensure $R(x)^\top R(y) = R(y-x)$ (additivity/relativity property). Standard RoPE is a special case using block-diagonal 2D rotations with hand-designed frequencies; STRING and ComRoPE generalize this by introducing arbitrary trainable commuting angle matrices, supporting higher-dimensional, multimodal (e.g., RGB+D, video), and even learnable position encodings (Liu et al., 7 Apr 2025, Schenck et al., 4 Feb 2025, Yu et al., 4 Jun 2025).

The block-diagonal structure ensures computational tractability while allowing efficient dot product computation and translation invariance.

5. Scaling, Interpolation Strategies, and Optimization

Uniform linear scaling is the prototypical form of position interpolation; however, more refined strategies apply distinct scaling factors to different layers ("layer-specific scaling") or heads (Wang et al., 6 Mar 2025). In this approach, layerwise scaling factors $\sigma_l$ are selected (via genetic algorithms with Bézier-curve parameterization to reduce search space) such that: $f_{\text{scaled}}(x, m; \sigma_l) = R_{\Theta, m/\sigma_l} x$ This improves accuracy by up to 20% on key-value retrieval benchmarks and alleviates the tendency of uniform scaling to sacrifice end-of-context retrieval for improved middle-context attention. Layer-specific interpolation enables models to better allocate representational capacity across layers, yielding improved extrapolation, lower perplexity on extended contexts, and more balanced long-range attention.

6. Empirical Evaluations and Practical Applications

The PI-RoPE framework has consistently yielded practical benefits across tasks:

Application Area	Empirical Benefit	Reference
Language modeling	Better perplexity for long contexts	(Al-Khateeb et al., 2023)
Summarization/retrieval (16k+)	Dramatic ROUGE and retrieval gains	(Al-Khateeb et al., 2023)
ASR (speech recognition)	Lower error, faster training	(Zhang et al., 10 Jan 2025)
Vision, video, robotics	Improved accuracy, 3D/2D invariance	(Schenck et al., 4 Feb 2025, Liu et al., 17 Feb 2025)
Long-context LLMs	Up to 20% better retrieval; balanced accuracy	(Wang et al., 6 Mar 2025)

In real-world settings, PI-RoPE and its generalizations (e.g., STRING, VRoPE, 3D-RPE) are deployed in LLMs for document understanding, code and reasoning tasks, ASR systems, Vision Transformers, robotics, and multimodal transformers for video and images (Schenck et al., 4 Feb 2025, Ma et al., 14 Jun 2024, Liu et al., 17 Feb 2025).

No additional retraining is required for position interpolation—scaling can be applied post hoc during inference, making it a simple and practical approach for context window extension.

7. Limitations, Challenges, and Future Directions

While position-interpolated rotary encoding efficiently enables flexible, scalable context windows, empirical findings reveal important subtleties:

Decay rate slows with scaling, which may spread attention too broadly if not tuned per layer, undermining focus on recent tokens—necessitating layer/head-specific optimization (Wang et al., 6 Mar 2025).
In certain settings, explicit relative position encodings (e.g., T5's relative bias) still outperform interpolated RoPE for pure length generalization, and Transformers with no explicit PE can sometimes implicitly learn positional structure (Kazemnejad et al., 2023).
Designs using only high-frequency components (e.g., HoPE (Chen et al., 28 Oct 2024)) or block-interpolated/chunked geometries (e.g., 3D-RPE (Ma et al., 14 Jun 2024)) show promise for future methods, with improved robustness, controllable decay, and enhanced resolution.
For cross-modal settings (vision-language), geometric encoding (e.g., Circle-RoPE (Wang et al., 22 May 2025)) and layerwise alternation balance specificity and modality decoupling.
Integration with other architectures (e.g., state-space models via Unified RoPE (Wu et al., 11 Jun 2025)) enables seamless long-context and hybrid sequence modeling.

A plausible implication is that the future of positional encoding will involve a diverse toolbox: flexible, theoretically grounded transformations (e.g., trainable commuting matrices), position interpolation or scaling adapted per layer or modality, and composition with bias, polynomial, or neural encodings tailored to model depth and task.

References

(Su et al., 2021) RoFormer: Enhanced Transformer with Rotary Position Embedding
(Al-Khateeb et al., 2023) Position Interpolation Improves ALiBi Extrapolation
(Zhang et al., 10 Jan 2025) Benchmarking Rotary Position Embeddings for Automatic Speech Recognition
(Schenck et al., 4 Feb 2025) Learning the RoPEs: Better 2D and 3D Position Encodings with STRING
(Wang et al., 6 Mar 2025) Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling
(Yu et al., 4 Jun 2025) ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
(Ma et al., 14 Jun 2024) 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
(Chen et al., 28 Oct 2024) HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
(Wang et al., 22 May 2025) Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-LLMs
(Liu et al., 17 Feb 2025) VRoPE: Rotary Position Embedding for Video LLMs
(Wu et al., 11 Jun 2025) TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding
(Aggarwal, 29 Apr 2024) PoPE: Legendre Orthogonal Polynomials Based Position Encoding for LLMs
(Kazemnejad et al., 2023) The Impact of Positional Encoding on Length Generalization in Transformers
(Gu et al., 19 May 2025) Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling
(Liu et al., 7 Apr 2025) Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Encoding