Papers
Topics
Authors
Recent
Search
2000 character limit reached

Axial Rotary Positional Embeddings

Updated 16 April 2026
  • Axial rotary positional embeddings are a class of relative encodings that extend 1D RoPE to multi-axis domains like images and videos.
  • They leverage independent sinusoidal rotations along each spatial or spatiotemporal axis to capture geometric structure and support resolution scaling.
  • Advanced variants such as Spiral RoPE, GeoPE, and LieRE overcome axis-aligned limitations by encoding oblique and joint rotations to enhance Transformer performance.

Axial rotary positional embeddings are a class of relative positional encoding schemes for Transformer models that generalize one-dimensional Rotary Position Embeddings (RoPE) to multi-axis domains such as images (2D) and videos (3D), enabling efficient and extrapolatable encoding of token positions in high-dimensional data. These embeddings combine the mathematical advantages of sinusoidal and rotational encodings with structural awareness of the underlying data grids, providing a unified attention mechanism that supports resolution scaling, improved geometric modeling, and efficient use of computational resources.

1. Mathematical Foundations of Rotary and Axial RoPE

The standard 1D Rotary Positional Embedding (RoPE) represents a token’s position pp by rotating each pair of consecutive embedding channels (x2t−1,x2t)(x_{2t-1}, x_{2t}) in the complex plane, governed by a frequency ωt\omega_t drawn from a geometric sequence:

x~t=x2t−1+i x2t;RoPE(x~t,p)=x~t ei ωtp\tilde{x}_t = x_{2t-1} + i\,x_{2t};\quad \mathrm{RoPE}(\tilde{x}_t, p) = \tilde{x}_t\,e^{i\,\omega_t p}

This approach allows the attention score between tokens pp and qq to depend purely on relative displacement:

⟨RoPE(q,p), RoPE(k,q)⟩=q⊤R(p−q)k\langle \mathrm{RoPE}(\mathbf{q}, p),\, \mathrm{RoPE}(\mathbf{k}, q) \rangle = \mathbf{q}^\top R(p-q) \mathbf{k}

The extrapolation property of RoPE stems from the periodic structure of trigonometric rotations being well-defined for any pp.

In axial rotary positional embeddings, this principle is extended to multidimensional grids by partitioning the embedding vector and applying independent RoPE encodings along each spatial or spatiotemporal axis. For a 2D image token at position (x,y)(x, y), the embedding is split and separately rotated with respect to xx and (x2t−1,x2t)(x_{2t-1}, x_{2t})0, typically allocating half the embedding channels to each axis:

(x2t−1,x2t)(x_{2t-1}, x_{2t})1

(x2t−1,x2t)(x_{2t-1}, x_{2t})2

This approach allows Transformers to capture axis-aligned structure while maintaining extrapolation to arbitrary resolutions (Heo et al., 2024).

2. Limitations and Axis-Aligned Bias of Standard Axial RoPE

The standard axial 2D RoPE formulation encodes relative positional offsets only along coordinate axes. For two patches (x2t−1,x2t)(x_{2t-1}, x_{2t})3 and (x2t−1,x2t)(x_{2t-1}, x_{2t})4, this scheme interprets the offset as a unit step along each axis but fails to differentiate diagonal displacements, due to the independence of the rotations. In the 2D Fourier domain, the resulting basis functions are strictly axis-aligned, expressed as (x2t−1,x2t)(x_{2t-1}, x_{2t})5 or (x2t−1,x2t)(x_{2t-1}, x_{2t})6. Consequently, diagonal or oblique relationships in the data cannot be directly encoded, impairing the model's ability to reconstruct non-axis-aligned structures, such as circles or arbitrary contours, and leading to visual artifacts and suboptimal attention patterns (Liu et al., 3 Feb 2026).

3. Multi-Directional and Geometric Generalizations

To overcome axis-aligned constraints, several geometric generalizations of axial rotary positional embeddings have been introduced.

Spiral RoPE: This method partitions the embedding channels into (x2t−1,x2t)(x_{2t-1}, x_{2t})7 groups, each assigned to one of (x2t−1,x2t)(x_{2t-1}, x_{2t})8 uniformly distributed planar directions. Within each group, embedding rotation is governed by the projection of the token’s position onto the associated direction vector:

(x2t−1,x2t)(x_{2t-1}, x_{2t})9

ωt\omega_t0

Each group is then rotated as:

ωt\omega_t1

With ωt\omega_t2 and isotropic frequency allocation, Spiral RoPE provides coverage for all directions, encoding oblique and curved spatial relationships (Liu et al., 3 Feb 2026).

GeoPE: GeoPE introduces a unified geometric approach using quaternion-based rotations to account for all axes symmetrically. By computing the geometric (log-exp) mean of the axis-aligned rotations in the Lie algebra ωt\omega_t3, GeoPE generates a rotation operator whose phase is proportional to the Euclidean distance:

ωt\omega_t4

ωt\omega_t5

This isometric construction ensures sensitivity to true spatial displacement, not just axis-projected offset (Yao et al., 4 Dec 2025).

LieRE: Lie Relative Encodings further generalize these ideas to arbitrary dimensions, employing a learned linear map from ωt\omega_t6 positions to ωt\omega_t7 skew-symmetric generators, then applying the matrix exponential to obtain full joint rotations in the Transformer head space. This formulation enables the encoding of complex inter-axis couplings and supports non-commutative rotations (Ostmeier et al., 2024).

4. Empirical Performance and Applications

Axial and multi-directional rotary embeddings provide consistent accuracy improvements across computer vision tasks, including classification, detection, segmentation, and generative modeling. Empirical evaluation on ImageNet-1k, ADE20k, and MS-COCO demonstrates clear quantitative gains over absolute and axis-aligned positional encoding baselines, as summarized in the following table (numbers extracted from the cited sources):

Method ImageNet Top-1 Acc. (%) ADE20k mIoU (%) COCO Box mAP
APE 82.36 46.91 49.4
Axial RoPE 83.15 – 50.8
Spiral RoPE 83.39 49.12 –
GeoPE 82.5 – 51.3
LieRE (2D) 69.4* – –

*Reported for CIFAR-100 (Ostmeier et al., 2024).

Qualitative analysis reveals that Spiral RoPE yields sharper, more object-centric attention maps that better respect local boundaries and capture diagonal or curved structures, in contrast to the less discriminative, axis-biased patterns observed with standard axial RoPE (Liu et al., 3 Feb 2026).

5. High-Dimensional and Spatiotemporal Extensions

Axial rotary positional embeddings extend naturally to 3D and spatiotemporal domains. For video models, RoPE-3D adapts the embedding by splitting channels among temporal and spatial axes, yielding independent axis-aligned phases. However, this approach can induce positional bias and modality discontinuities.

VRoPE addresses these issues by introducing diagonal spatial coordinates and symmetric bias mitigation. Each spatial position is mapped onto ωt\omega_t8 diagonals, and both positive and negative direction encodings are summed for each group of embedding channels, thereby balancing long-term decay and aligning attention distributions across space and modality boundaries. VRoPE has been shown to significantly boost retrieval and reasoning metrics in video-LLMs, particularly for long-form sequences (Liu et al., 17 Feb 2025).

6. Theoretical Properties and Analysis

The spectral structure of rotary positional encodings induces a multiresolution, band-pass filtering effect. Each frequency ωt\omega_t9 defines a particular scale of positional sensitivity, and Transformer attention combined with MLP nonlinearities can synthesize higher-order harmonics, enabling flexible, wavelet-like processing (Ruscio et al., 2024). This multi-scale behavior is critical for the observed length extrapolation and resolution scaling advantages.

Lie group and quaternion-based constructions (LieRE, GeoPE) further ensure that the composite rotations are mathematically isometric, symmetric, and sensitive to global geometric structure. These schemes overcome expressivity deficiencies of strictly axis-aligned or uncoupled embeddings, demonstrating enhanced generalization and data efficiency (Yao et al., 4 Dec 2025, Ostmeier et al., 2024).

7. Implementation Considerations and Future Directions

Practical implementation requires block-diagonal or groupwise rotations for efficiency, with precomputed frequency schedules. Spiral RoPE and related methods introduce no parameter or runtime overhead compared to standard RoPE, making them attractive for large-scale vision and video transformers (Liu et al., 3 Feb 2026). The use of learned generator matrices (as in LieRE) trades off compute cost for flexibility.

Current research directions include adaptive or learned direction weighting, non-uniform direction/path sampling, extensions to non-Euclidean settings, and hybrid integration with global-local or bias-based positional schemes. Generalization to higher dimensions (e.g., spatiotemporal volumes, point clouds) remains open for further study (Liu et al., 3 Feb 2026, Yao et al., 4 Dec 2025, Ostmeier et al., 2024).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Axial Rotary Positional Embeddings.