Native-RoPE: Disentangled Rotary Position Embeddings

Updated 5 February 2026

The paper introduces Native-RoPE, a method that cleanly separates content from positional information in attention models.
It leverages the representation theory of Lie algebras via maximal abelian subalgebras to enable efficient, learnable, and multidimensional positional encoding.
Empirical results show that Native-RoPE and its instantiation PoPE outperform standard RoPE in accuracy, perplexity, and long-context extrapolation tasks.

Disentangled Rotary Position Embeddings (Native-RoPE) are a theoretical and algorithmic refinement of rotary position embedding (RoPE), proposed to achieve clean separation of content (“what”) and positional (“where”) information in attention-based neural architectures. Native-RoPE grounds its constructions in the representation theory of Lie algebras—specifically, maximal abelian subalgebras (MASAs) of the special orthogonal Lie algebra—providing a principled unified blueprint for positional embeddings in one and higher dimensions. This framework explains, extends, and unifies prior RoPE variants while enabling both efficient algorithmic implementation and new design freedom for learnable interaction across spatial dimensions. Additionally, Native-RoPE (and its instantiations such as Polar Coordinate Position Embeddings, PoPE) are shown to outperform standard RoPE on tasks requiring disentangled content and position matching and on long-context sequence extrapolation tasks (Liu et al., 7 Apr 2025, Gopalakrishnan et al., 5 Sep 2025).

1. Mathematical Foundations of RoPE and Disentanglement

Rotary Position Embeddings inject relative positional information into attention-based models by applying parameterized rotations to query and key vectors. Formally, for a position vector $\mathbf{x}\in\mathbb{R}^N$ , RoPE constructs a rotation matrix $R_{\mathbf{x}}\in SO(2N)$ that encodes position as a product of elemental rotations. The standard construction for one-dimensional sequences partitions embedding vectors into $d/2$ two-dimensional subspaces and applies independent planar rotations (parameterized by distinct frequencies) per subspace (Su et al., 2021). The core RoPE operations satisfy two axioms:

Relativity: For any two positions $\mathbf{x}_1, \mathbf{x}_2$ , the composition satisfies $R_{\mathbf{x}_1}^\top R_{\mathbf{x}_2} = R_{\mathbf{x}_2-\mathbf{x}_1}$ .
Reversibility (Injectivity): The mapping from position to rotation is injective; that is, $R_{\mathbf{x}_1} = R_{\mathbf{x}_2}$ implies $\mathbf{x}_1 = \mathbf{x}_2$ (Liu et al., 7 Apr 2025).

Native-RoPE generalizes the construction by requiring the $N$ generator matrices $\{B_i\}\subset \mathfrak{so}(2N)$ to form a basis of a MASA—maximal abelian subalgebra—ensuring commutativity $[B_i, B_j] = 0$ and linear independence, so that the mapping $\mathbf{x} \mapsto \sum_i x^{(i)} B_i$ remains injective.

2. Construction via Maximal Abelian Subalgebras and Orthogonal Mixing

Standard RoPE corresponds to choosing the maximal toral MASA in $\mathfrak{so}(2N)$ , resulting in block-diagonal $2\times2$ skew-symmetric matrices $E_i$ acting independently on each coordinate pair. This yields

$R_{\mathbf{x}} = \exp\left(\sum_{i=1}^N x^{(i)} \theta_i E_i\right) = \bigoplus_{i=1}^N R_{\theta_i}(x^{(i)})$

with $R_{\theta}(t)$ denoting rotation by $t \theta$ in the plane. To enable structured or learned interactions among position dimensions while preserving relativity and reversibility, Native-RoPE introduces an orthogonal change of basis via $Q \in SO(2N)$ :

$B_i = Q E_i Q^\top,\quad R_{\mathbf{x}} = Q \left[\bigoplus_{i=1}^N R_{\theta_i}(x^{(i)})\right] Q^\top$

This approach preserves all necessary algebraic properties; conjugation by $Q$ yields another MASA and enables full-rank inter-dimensional mixing without sacrificing computational tractability of block-wise rotation (Liu et al., 7 Apr 2025).

3. Disentanglement of Content and Position: Theory and Implementation

Standard RoPE entangles content ("what") information (magnitude and phase of each subspace) with positional ("where") information (rotation phase), as the attention logit becomes:

$a_{t,s} = \sum_{c=1}^{d/2} \|q_{tc}\|\,\|k_{sc}\|\,\cos((s-t)\theta_c + \varphi_{k,sc} - \varphi_{q,tc})$

where $\varphi_{q,tc}, \varphi_{k,sc}$ are content-dependent phases, interfering with pure relative position encoding (Gopalakrishnan et al., 5 Sep 2025).

Native-RoPE (as in PoPE) achieves clean separation by constructing query/key as complex vectors whose phases depend deterministically only on position:

Content ( $r_q$ , $r_k$ ): Encoded by magnitudes via non-negative activation (e.g., softplus).
Position ( $\theta_{q}$ , $\theta_{k}$ ): Encoded solely through controlled per-channel phase, $t\theta_c$ for token position $t$ .

Attention then becomes

$a_{t,s} = \sum_{c=1}^{d} r_{q,tc} r_{k,sc} \cos((s-t)\theta_c + \delta_c)$

eliminating any content-driven phase interaction. The per-channel bias $\delta_c$ is introduced for maximum control, optionally learned and clamped to $[-2\pi, 0]$ . This structure collectively enables:

Content-parametrized magnitude.
Position-parametrized (not content-shifted) phase (Gopalakrishnan et al., 5 Sep 2025).

An alternative, block-linear disentangled construction reserves distinct subspaces for content and position, applying RoPE only to the position-projection and leaving the content-subspace invariant. The attention score can then be decomposed directly into content and position similarity:

$\text{score}_{m,n} \approx (Q^C_m)^\top K^C_n + (Q^P_m)^\top R_{n-m} K^P_n$

(Su et al., 2021).

4. Algorithmic Steps and Practical Instantiations

The Native-RoPE/PoPE procedures operate as follows:

Input: Spatial dimension $N$ ; frequencies $\{\theta_i\}$ ; optional learnable skew-parameter $A\in\mathbb{R}^{d\times d}, A^\top=-A$ (for $Q$ ).
Generator Construction: Build toral generators $E_i$ as $2N \times 2N$ block-diagonal matrices.
Orthogonal Mixing: Compute $Q$ using Cayley transform, matrix exponential, or Givens rotations; form $B_i = Q(\theta_i E_i) Q^\top$ .
Position Mapping: For absolute position $\mathbf{x}$ , compute $B(\mathbf{x}) = \sum_i x^{(i)} B_i$ .
Rotation Application: Evaluate $R_{\mathbf{x}} = \exp(B(\mathbf{x}))$ , exploiting block structure for efficiency:

$R_{\mathbf{x}} = Q\, \text{diag}(R_{\theta_1}(x^{(1)}), ..., R_{\theta_N}(x^{(N)}))\, Q^\top$

Application to Q/K: Rotate input vectors, or in the PoPE instantiation, assemble $\tilde{q}_{tc} = r_{q,tc} e^{i \theta_{q,tc}}$ with non-negative $r_{q,tc}$ and phase $\theta_{q,tc}$ , and similarly for $k$ vectors.

Implementation can be fused for efficiency. In PoPE, frequencies are regularly spaced, and the magnitude-phase computation is kernel-fusible requiring only a single additional multiply compared to RoPE (Gopalakrishnan et al., 5 Sep 2025).

5. Empirical Evaluations and Comparative Results

Native-RoPE (PoPE) demonstrates empirical advantages in both synthetic and real modeling tasks:

Indirect Indexing (pointer arithmetic): PoPE achieves ≈95% accuracy vs RoPE’s ≈11%, confirming robust separation of content and position signals.
Autoregressive Sequence Modeling (music, genomics, natural language): Across tasks such as symbolic music (JSB, MAESTRO), genome (NLL), and language modeling (OpenWebText, perplexity), PoPE consistently outperforms RoPE across model scales ($124$M to $774$M parameters).
Length Extrapolation: On long-context benchmarks (e.g., PG-19 up to $10\times$ training context), native-RoPE maintains flat perplexity, while RoPE degrades and specialized extrapolation methods (e.g., YaRN) require fine-tuning and frequency interpolation.
Downstream Zero-shot Tasks: PoPE surpasses RoPE across LAMBADA, BLIMP, CBT, HellaSwag, PIQA, ARC-E.
Frequency Utilization: PoPE engages a full range of frequencies, as demonstrated by frequency-usage heatmaps, while RoPE underutilizes higher channels due to phase entanglement (Gopalakrishnan et al., 5 Sep 2025).

Task	RoPE	Native-RoPE (PoPE)
Indirect Indexing Acc.	$11.16\pm2.45\%$	$94.82\pm2.91\%$
OWT-Perplexity (124M-774M)	$21.55-15.85$	$21.33-15.45$
JSB NLL	$0.5081$	$0.4889$
MAESTRO NLL	$1.501$	$1.486$
Genome NLL	$4.217$	$4.152$

6. Limitations, Trade-offs, and Future Directions

Native-RoPE increases memory and bandwidth requirements per attention head, as real and imaginary parts for complex representations require doubled storage relative to RoPE; however, kernel-fused implementations can mitigate this overhead. The per-frequency bias $\delta_c$ introduces $d$ additional parameters per head. Compute cost increases marginally (one extra element-wise multiply per position-channel pair) but is insignificant compared to total attention compute.

Several open directions remain:

PoPE and Native-RoPE rely on fixed frequency grids; learning frequencies or adopting non-uniform intervals is unstudied.
Broader evaluation for cross-attention, encoder-decoder architectures, and bidirectional tasks is outstanding.
Optimization of fused-kernel implementations to eliminate any remaining overhead.
Empirical scaling studies for models with $>$ 10B parameters and in the context of long-context pretraining and fine-tuning strategies (Gopalakrishnan et al., 5 Sep 2025).

A plausible implication is that Native-RoPE’s algebraic generality and empirical superiority on sequence and structure-sensitive tasks render it a strong candidate for deployment in domains demanding robust length generalization, hierarchical reasoning, and high-fidelity localization, such as genomics, symbolic music, and document-scale language modeling.

7. Relation to Broader Context and Advancements

The mathematical unification of RoPE and its generalization through MASAs supplies a rigorous foundation for all rotary-type position encodings, reconciling empirical design with representation-theoretic principles. Disentangled approaches extend RoPE’s proven relative position sensitivity while removing confounds in compositional attention. This advances the field toward more interpretable, reliably extensible, and theoretically grounded positional modeling in neural sequence architectures (Liu et al., 7 Apr 2025, Su et al., 2021, Gopalakrishnan et al., 5 Sep 2025).