Papers
Topics
Authors
Recent
2000 character limit reached

3D Rotary Positional Embeddings

Updated 14 December 2025
  • 3D rotary positional embeddings are methods that encode 3D spatial and temporal coordinates using rotations, enabling translation equivariance and efficient attention computations.
  • They apply axis-wise or joint rotations via block-diagonal and Lie-group based approaches to incorporate multidimensional geometric information with minimal computational overhead.
  • Empirical results demonstrate enhanced performance in video modeling, 3D medical segmentation, robotics, and texture synthesis by capturing cross-axis relationships.

A three-dimensional rotary positional embedding (3D-RoPE) is a class of positional encoding methods that generalize the principle of rotary position encodings (RoPE) to structured, multidimensional data such as videos, volumetric medical images, 3D geometric data, or long sequences with quantum-inspired structure. 3D-RoPE methods leverage higher-dimensional coordinate information to inject geometric or spatiotemporal priors into Transformer-based architectures using block-diagonal or Lie-group–based rotations, thus encoding relative 3D position information directly within the attention mechanism. They exhibit translation equivariance, separability, and maintain a low computational overhead, with extensive applications and empirical validation across video-LLMs, vision transformers with depth, 3D medical segmentation, multimodal LLMs, and geometric deep learning.

1. Fundamentals of Rotary Positional Encodings

Rotary positional encodings rotate token feature representations in the plane defined by adjacent channels, using position-dependent angles. In 1D RoPE, an embedding x(p)RDx(p)\in\mathbb{R}^D at position pp is partitioned into D/2D/2 complex pairs (x2i,x2i+1)(x_{2i},x_{2i+1}), each rotated by an angle θi(p)=p/100002i/D\theta_i(p) = p / 10000^{2i/D}. This is implemented as

(x2i x2i+1)=(cosθi(p)sinθi(p) sinθi(p)cosθi(p))(x2i x2i+1)\begin{pmatrix} x'_{2i} \ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos\theta_i(p) & -\sin\theta_i(p) \ \sin\theta_i(p) & \cos\theta_i(p) \end{pmatrix} \begin{pmatrix} x_{2i} \ x_{2i+1} \end{pmatrix}

The key algebraic property R(p1)R(p2)=R(p1+p2)R(p_1) R(p_2) = R(p_1 + p_2) ensures that attention correlations depend on position differences rather than absolute positions, yielding translation-invariant relative encoding (Schenck et al., 4 Feb 2025).

Extension to higher dimensions involves either (i) assigning rotations independently along each coordinate axis, or (ii) constructing a unified geometric rotation using Lie group or quaternionic algebra to jointly encode spatial and/or temporal positions (Wang et al., 17 Jun 2025, Yao et al., 4 Dec 2025, Ostmeier et al., 14 Jun 2024).

2. 3D Rotary Formulations: Axis-wise and Joint Rotations

Axis-wise 3D RoPE (widely used in video and volumetric vision) applies independent 1D rotary embeddings along each of the three axes, typically after splitting the channel dimension:

  • For a 3D position (x,y,z)(x, y, z) and d=6pd = 6p, partition features into three blocks and rotate each 2D block by its axis-specific frequency and coordinate. In RomanTex:

αk={xθk,k<p yθkp,pk<2p zθk2p,2pk<3p\alpha_k = \begin{cases} x\,\theta_k, & k < p\ y\,\theta_{k-p}, & p \le k < 2p\ z\,\theta_{k-2p}, & 2p \le k < 3p \end{cases}

The rotation is applied as in the original RoPE (Feng et al., 24 Mar 2025).

  • In video architectures (EVA02-AT, RoMedFormer), this sums the angles from all axes for each complex pair in the representation, resulting in a single-walk rotation per subblock:

xR(θi(H)(h)+θj(W)(w)+θk(T)(t))xx \mapsto R(\theta^{(H)}_i(h) + \theta^{(W)}_j(w) + \theta^{(T)}_k(t))\,x

(Wang et al., 17 Jun 2025, Li et al., 18 Mar 2025).

Joint or Geometric 3D RoPE leverages the full structure of 3D rotations, using Lie groups (SO(3)SO(3)), quaternions, or block-diagonal matrix composition. For example, in STRING and LieRE, a token's rotation is given by

R(x,y,z)=exp(Lxx+Lyy+Lzz)\mathbf{R}(x,y,z) = \exp(\mathbf{L}_x\,x + \mathbf{L}_y\,y + \mathbf{L}_z\,z)

where each Lk\mathbf{L}_k is a skew-symmetric generator, guaranteeing commutativity and efficient invertibility (Schenck et al., 4 Feb 2025, Ostmeier et al., 14 Jun 2024). GeoPE constructs composite quaternion rotations in H\mathbb{H}, averaging the so(3) logarithms to achieve symmetric, Euclidean-coupled embeddings (Yao et al., 4 Dec 2025).

3. Integration into Attention Mechanisms

For all variants, 3D RoPE is applied to query (QQ) and key (KK) projections just after their linear transformation and before the attention dot product:

  • For axis-wise (blockwise) schemes:

Q=R(x,y,z)Q,K=R(x,y,z)KQ' = R_{(x,y,z)} Q, \quad K' = R_{(x,y,z)} K

where R(x,y,z)R_{(x,y,z)} is either a block-diagonal product of Rx(x)R_x(x), Ry(y)R_y(y), and Rz(z)R_z(z) or a single rotation by the sum of axis-specific angles (Wang et al., 17 Jun 2025, Li et al., 18 Mar 2025, Feng et al., 24 Mar 2025).

  • For joint geometric encodings:

Q=R(x,y,z)Q,K=R(x,y,z)KQ' = \mathbf{R}(x,y,z) Q, \quad K' = \mathbf{R}(x,y,z) K

with R\mathbf{R} as above (Schenck et al., 4 Feb 2025, Yao et al., 4 Dec 2025).

  • In all cases, the value projection (VV) is left unrotated.

This structure is compatible with all variants of transformer-based attention, including self-attention, cross-attention, and multi-attention blocks (e.g., RomanTex MVA) (Feng et al., 24 Mar 2025).

The overhead incurred is negligible—typically two D×DD\times D block-diagonal multiplies per token. No extra trainable parameters are introduced, and memory is dominated by the need to store sin/cos tables for the set of positions times the number of frequencies (Wang et al., 17 Jun 2025).

4. Empirical Performance and Benchmarks

The adoption of 3D RoPE yields consistent improvements across diverse domains:

Task/Domain Baseline PE 3D-RoPE Variant Gain (%) Reference
Video MIR (EK-100, mAP, zero-shot) split-3×RoPE 32.9 Joint 3D ST-RoPE 34.3 +1.4 (Wang et al., 17 Jun 2025)
Video MIR (EK-100, fine-tune) MI-MM loss 51.8 EVA02-AT+SMS 59.0 +7.2 (Wang et al., 17 Jun 2025)
Open-vocab 3D Box Prediction Baseline 49.77 Circulant-STRING 58.95 +9.18 (Schenck et al., 4 Feb 2025)
Medical Segmentation (Dice, qualitative) APE 3D-RoPE improved boundary / Dice +1.1* (Li et al., 18 Mar 2025)
3D Texture Synthesis (LAD) MVA only 0.123 3D-RoPE+MVA 0.119 ~5% lower LAD (Feng et al., 24 Mar 2025)
Video-LLM Retrieval (long video, Video-NIAH) RoPE-3D 72.81 VRoPE 87.03 +14.22 (Liu et al., 17 Feb 2025)
Robotics—Multi-task Success RoPE 41.7 STRING 45.8 +4.1 (Schenck et al., 4 Feb 2025)

*RoMedFormer ablation; not in main paper.

Ablation studies consistently demonstrate gains for full 3D rotary encodings over split or axis-wise versions, particularly in capturing cross-axis relationships, improving spatial/temporal localization, and reducing artifacts (e.g., view seams, texture discontinuities, or attention bias) (Wang et al., 17 Jun 2025, Yao et al., 4 Dec 2025, Feng et al., 24 Mar 2025, Liu et al., 17 Feb 2025).

5. Variants and Extensions

Several frameworks have extended 3D-RoPE:

  • STRING: A generalization using commuting generators in Lie algebra; enables fast Cayley or circulant implementations, and supports arbitrary additional positioning modalities such as depth or semantics (Schenck et al., 4 Feb 2025).
  • LieRE: Parameterizes trainable 3D rotations via exponentiation of learned skew-symmetric matrices, achieving translation equivariance on SO(3)SO(3) (Ostmeier et al., 14 Jun 2024).
  • GeoPE: Constructs rotations as quaternionic sandwich products, ensuring geometrically isotropic positional coupling via so(3) mean-averaged phases—a critical advance over axis-wise RoPE for 3D spatial and shape-sensitive applications (Yao et al., 4 Dec 2025).
  • VRoPE: Designed for Video-LLMs, introduces continuity and symmetry index transforms over 3D+1 (spatiotemporal+text) positional indices to mitigate bias and enable seamless cross-modal attention (Liu et al., 17 Feb 2025).
  • 3D-RPE (Bloch Sphere): Inspired by quantum state embeddings, splits long sequences into two-angle (polar + azimuthal) rotations for improved position resolution and decay, outperforming RoPE in long-context NLU and LM (Ma et al., 14 Jun 2024).
  • RomanTex: Incorporates 3D-aware RoPE in multi-attention blocks for consistent, geometry-aware texture synthesis on 3D assets; canonicalizes pixel-wise 3D positions through coordinate maps (Feng et al., 24 Mar 2025).

6. Design Challenges and Controversies

A recurring limitation among naive 3D RoPE designs is their tendency to treat the axes independently, thereby failing to encode true Euclidean or cross-modal relationships. This issue was addressed by GeoPE and LieRE through geometric coupling and Lie algebraic averaging, and by VRoPE through symmetric index interleaving and cross-modal continuity (Yao et al., 4 Dec 2025, Liu et al., 17 Feb 2025, Ostmeier et al., 14 Jun 2024). Furthermore, chunking strategies in quantum-inspired 3D-RPE frameworks improve long-term attention decay and position interpolation in LLMs, outperforming Pi/NTK scaling (Ma et al., 14 Jun 2024).

Another challenge is the non-commutativity of rotations in three dimensions. GeoPE resolves this by mapping quaternions to the Lie algebra, averaging, and re-exponentiating to form a symmetric, isotropic rotation, in contrast to the bias introduced by ordered axis-wise composition (Yao et al., 4 Dec 2025).

Computational efficiency is generally preserved; all practical schemes ensure O(Nd)O(Nd) complexity per layer, with no extra parameters, and exploit fast table lookup or FFT tricks where possible (Wang et al., 17 Jun 2025, Schenck et al., 4 Feb 2025).

7. Applications and Outlook

3D Rotary Positional Embeddings are now foundational in diverse domains:

Further generalizations to higher-order geometries (e.g., 4D, arbitrary Lie groups) and cross-modal multimodal fusions are anticipated (Liu et al., 17 Feb 2025, Yao et al., 4 Dec 2025).


References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to 3D Rotary Positional Embeddings.