Axial Rotary Positional Embeddings

Updated 16 April 2026

Axial rotary positional embeddings are a class of relative encodings that extend 1D RoPE to multi-axis domains like images and videos.
They leverage independent sinusoidal rotations along each spatial or spatiotemporal axis to capture geometric structure and support resolution scaling.
Advanced variants such as Spiral RoPE, GeoPE, and LieRE overcome axis-aligned limitations by encoding oblique and joint rotations to enhance Transformer performance.

Axial rotary positional embeddings are a class of relative positional encoding schemes for Transformer models that generalize one-dimensional Rotary Position Embeddings (RoPE) to multi-axis domains such as images (2D) and videos (3D), enabling efficient and extrapolatable encoding of token positions in high-dimensional data. These embeddings combine the mathematical advantages of sinusoidal and rotational encodings with structural awareness of the underlying data grids, providing a unified attention mechanism that supports resolution scaling, improved geometric modeling, and efficient use of computational resources.

1. Mathematical Foundations of Rotary and Axial RoPE

The standard 1D Rotary Positional Embedding (RoPE) represents a token’s position $p$ by rotating each pair of consecutive embedding channels $(x_{2t-1}, x_{2t})$ in the complex plane, governed by a frequency $\omega_t$ drawn from a geometric sequence:

$\tilde{x}_t = x_{2t-1} + i\,x_{2t};\quad \mathrm{RoPE}(\tilde{x}_t, p) = \tilde{x}_t\,e^{i\,\omega_t p}$

This approach allows the attention score between tokens $p$ and $q$ to depend purely on relative displacement:

$\langle \mathrm{RoPE}(\mathbf{q}, p),\, \mathrm{RoPE}(\mathbf{k}, q) \rangle = \mathbf{q}^\top R(p-q) \mathbf{k}$

The extrapolation property of RoPE stems from the periodic structure of trigonometric rotations being well-defined for any $p$ .

In axial rotary positional embeddings, this principle is extended to multidimensional grids by partitioning the embedding vector and applying independent RoPE encodings along each spatial or spatiotemporal axis. For a 2D image token at position $(x, y)$ , the embedding is split and separately rotated with respect to $x$ and $(x_{2t-1}, x_{2t})$ 0, typically allocating half the embedding channels to each axis:

$(x_{2t-1}, x_{2t})$ 1

$(x_{2t-1}, x_{2t})$ 2

This approach allows Transformers to capture axis-aligned structure while maintaining extrapolation to arbitrary resolutions (Heo et al., 2024).

2. Limitations and Axis-Aligned Bias of Standard Axial RoPE

The standard axial 2D RoPE formulation encodes relative positional offsets only along coordinate axes. For two patches $(x_{2t-1}, x_{2t})$ 3 and $(x_{2t-1}, x_{2t})$ 4, this scheme interprets the offset as a unit step along each axis but fails to differentiate diagonal displacements, due to the independence of the rotations. In the 2D Fourier domain, the resulting basis functions are strictly axis-aligned, expressed as $(x_{2t-1}, x_{2t})$ 5 or $(x_{2t-1}, x_{2t})$ 6. Consequently, diagonal or oblique relationships in the data cannot be directly encoded, impairing the model's ability to reconstruct non-axis-aligned structures, such as circles or arbitrary contours, and leading to visual artifacts and suboptimal attention patterns (Liu et al., 3 Feb 2026).

3. Multi-Directional and Geometric Generalizations

To overcome axis-aligned constraints, several geometric generalizations of axial rotary positional embeddings have been introduced.

Spiral RoPE: This method partitions the embedding channels into $(x_{2t-1}, x_{2t})$ 7 groups, each assigned to one of $(x_{2t-1}, x_{2t})$ 8 uniformly distributed planar directions. Within each group, embedding rotation is governed by the projection of the token’s position onto the associated direction vector:

$(x_{2t-1}, x_{2t})$ 9

$\omega_t$ 0

Each group is then rotated as:

$\omega_t$ 1

With $\omega_t$ 2 and isotropic frequency allocation, Spiral RoPE provides coverage for all directions, encoding oblique and curved spatial relationships (Liu et al., 3 Feb 2026).

GeoPE: GeoPE introduces a unified geometric approach using quaternion-based rotations to account for all axes symmetrically. By computing the geometric (log-exp) mean of the axis-aligned rotations in the Lie algebra $\omega_t$ 3, GeoPE generates a rotation operator whose phase is proportional to the Euclidean distance:

$\omega_t$ 4

$\omega_t$ 5

This isometric construction ensures sensitivity to true spatial displacement, not just axis-projected offset (Yao et al., 4 Dec 2025).

LieRE: Lie Relative Encodings further generalize these ideas to arbitrary dimensions, employing a learned linear map from $\omega_t$ 6 positions to $\omega_t$ 7 skew-symmetric generators, then applying the matrix exponential to obtain full joint rotations in the Transformer head space. This formulation enables the encoding of complex inter-axis couplings and supports non-commutative rotations (Ostmeier et al., 2024).

4. Empirical Performance and Applications

Axial and multi-directional rotary embeddings provide consistent accuracy improvements across computer vision tasks, including classification, detection, segmentation, and generative modeling. Empirical evaluation on ImageNet-1k, ADE20k, and MS-COCO demonstrates clear quantitative gains over absolute and axis-aligned positional encoding baselines, as summarized in the following table (numbers extracted from the cited sources):

Method	ImageNet Top-1 Acc. (%)	ADE20k mIoU (%)	COCO Box mAP
APE	82.36	46.91	49.4
Axial RoPE	83.15	–	50.8
Spiral RoPE	83.39	49.12	–
GeoPE	82.5	–	51.3
LieRE (2D)	69.4*	–	–

*Reported for CIFAR-100 (Ostmeier et al., 2024).

Qualitative analysis reveals that Spiral RoPE yields sharper, more object-centric attention maps that better respect local boundaries and capture diagonal or curved structures, in contrast to the less discriminative, axis-biased patterns observed with standard axial RoPE (Liu et al., 3 Feb 2026).

5. High-Dimensional and Spatiotemporal Extensions

Axial rotary positional embeddings extend naturally to 3D and spatiotemporal domains. For video models, RoPE-3D adapts the embedding by splitting channels among temporal and spatial axes, yielding independent axis-aligned phases. However, this approach can induce positional bias and modality discontinuities.

VRoPE addresses these issues by introducing diagonal spatial coordinates and symmetric bias mitigation. Each spatial position is mapped onto $\omega_t$ 8 diagonals, and both positive and negative direction encodings are summed for each group of embedding channels, thereby balancing long-term decay and aligning attention distributions across space and modality boundaries. VRoPE has been shown to significantly boost retrieval and reasoning metrics in video-LLMs, particularly for long-form sequences (Liu et al., 17 Feb 2025).

6. Theoretical Properties and Analysis

The spectral structure of rotary positional encodings induces a multiresolution, band-pass filtering effect. Each frequency $\omega_t$ 9 defines a particular scale of positional sensitivity, and Transformer attention combined with MLP nonlinearities can synthesize higher-order harmonics, enabling flexible, wavelet-like processing (Ruscio et al., 2024). This multi-scale behavior is critical for the observed length extrapolation and resolution scaling advantages.

Lie group and quaternion-based constructions (LieRE, GeoPE) further ensure that the composite rotations are mathematically isometric, symmetric, and sensitive to global geometric structure. These schemes overcome expressivity deficiencies of strictly axis-aligned or uncoupled embeddings, demonstrating enhanced generalization and data efficiency (Yao et al., 4 Dec 2025, Ostmeier et al., 2024).

7. Implementation Considerations and Future Directions

Practical implementation requires block-diagonal or groupwise rotations for efficiency, with precomputed frequency schedules. Spiral RoPE and related methods introduce no parameter or runtime overhead compared to standard RoPE, making them attractive for large-scale vision and video transformers (Liu et al., 3 Feb 2026). The use of learned generator matrices (as in LieRE) trades off compute cost for flexibility.

Current research directions include adaptive or learned direction weighting, non-uniform direction/path sampling, extensions to non-Euclidean settings, and hybrid integration with global-local or bias-based positional schemes. Generalization to higher dimensions (e.g., spatiotemporal volumes, point clouds) remains open for further study (Liu et al., 3 Feb 2026, Yao et al., 4 Dec 2025, Ostmeier et al., 2024).

References:

"Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane" (Liu et al., 3 Feb 2026)
"GeoPE: A Unified Geometric Positional Embedding for Structured Tensors" (Yao et al., 4 Dec 2025)
"LieRE: Lie Rotational Positional Encodings" (Ostmeier et al., 2024)
"Rotary Position Embedding for Vision Transformer" (Heo et al., 2024)
"VRoPE: Rotary Position Embedding for Video LLMs" (Liu et al., 17 Feb 2025)
"Beyond Position: the emergence of wavelet-like properties in Transformers" (Ruscio et al., 2024)