3D Rotary Positional Embeddings

Updated 14 December 2025

3D rotary positional embeddings are methods that encode 3D spatial and temporal coordinates using rotations, enabling translation equivariance and efficient attention computations.
They apply axis-wise or joint rotations via block-diagonal and Lie-group based approaches to incorporate multidimensional geometric information with minimal computational overhead.
Empirical results demonstrate enhanced performance in video modeling, 3D medical segmentation, robotics, and texture synthesis by capturing cross-axis relationships.

A three-dimensional rotary positional embedding (3D-RoPE) is a class of positional encoding methods that generalize the principle of rotary position encodings (RoPE) to structured, multidimensional data such as videos, volumetric medical images, 3D geometric data, or long sequences with quantum-inspired structure. 3D-RoPE methods leverage higher-dimensional coordinate information to inject geometric or spatiotemporal priors into Transformer-based architectures using block-diagonal or Lie-group–based rotations, thus encoding relative 3D position information directly within the attention mechanism. They exhibit translation equivariance, separability, and maintain a low computational overhead, with extensive applications and empirical validation across video-LLMs, vision transformers with depth, 3D medical segmentation, multimodal LLMs, and geometric deep learning.

1. Fundamentals of Rotary Positional Encodings

Rotary positional encodings rotate token feature representations in the plane defined by adjacent channels, using position-dependent angles. In 1D RoPE, an embedding $x(p)\in\mathbb{R}^D$ at position $p$ is partitioned into $D/2$ complex pairs $(x_{2i},x_{2i+1})$ , each rotated by an angle $\theta_i(p) = p / 10000^{2i/D}$ . This is implemented as

$\begin{pmatrix} x'_{2i} \ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos\theta_i(p) & -\sin\theta_i(p) \ \sin\theta_i(p) & \cos\theta_i(p) \end{pmatrix} \begin{pmatrix} x_{2i} \ x_{2i+1} \end{pmatrix}$

The key algebraic property $R(p_1) R(p_2) = R(p_1 + p_2)$ ensures that attention correlations depend on position differences rather than absolute positions, yielding translation-invariant relative encoding (Schenck et al., 4 Feb 2025).

Extension to higher dimensions involves either (i) assigning rotations independently along each coordinate axis, or (ii) constructing a unified geometric rotation using Lie group or quaternionic algebra to jointly encode spatial and/or temporal positions (Wang et al., 17 Jun 2025, Yao et al., 4 Dec 2025, Ostmeier et al., 14 Jun 2024).

2. 3D Rotary Formulations: Axis-wise and Joint Rotations

Axis-wise 3D RoPE (widely used in video and volumetric vision) applies independent 1D rotary embeddings along each of the three axes, typically after splitting the channel dimension:

For a 3D position $(x, y, z)$ and $d = 6p$ , partition features into three blocks and rotate each 2D block by its axis-specific frequency and coordinate. In RomanTex:

$\alpha_k = \begin{cases} x\,\theta_k, & k < p\ y\,\theta_{k-p}, & p \le k < 2p\ z\,\theta_{k-2p}, & 2p \le k < 3p \end{cases}$

The rotation is applied as in the original RoPE (Feng et al., 24 Mar 2025).

In video architectures (EVA02-AT, RoMedFormer), this sums the angles from all axes for each complex pair in the representation, resulting in a single-walk rotation per subblock:

$x \mapsto R(\theta^{(H)}_i(h) + \theta^{(W)}_j(w) + \theta^{(T)}_k(t))\,x$

(Wang et al., 17 Jun 2025, Li et al., 18 Mar 2025).

Joint or Geometric 3D RoPE leverages the full structure of 3D rotations, using Lie groups ( $SO(3)$ ), quaternions, or block-diagonal matrix composition. For example, in STRING and LieRE, a token's rotation is given by

$\mathbf{R}(x,y,z) = \exp(\mathbf{L}_x\,x + \mathbf{L}_y\,y + \mathbf{L}_z\,z)$

where each $\mathbf{L}_k$ is a skew-symmetric generator, guaranteeing commutativity and efficient invertibility (Schenck et al., 4 Feb 2025, Ostmeier et al., 14 Jun 2024). GeoPE constructs composite quaternion rotations in $\mathbb{H}$ , averaging the so(3) logarithms to achieve symmetric, Euclidean-coupled embeddings (Yao et al., 4 Dec 2025).

3. Integration into Attention Mechanisms

For all variants, 3D RoPE is applied to query ( $Q$ ) and key ( $K$ ) projections just after their linear transformation and before the attention dot product:

For axis-wise (blockwise) schemes:

$Q' = R_{(x,y,z)} Q, \quad K' = R_{(x,y,z)} K$

where $R_{(x,y,z)}$ is either a block-diagonal product of $R_x(x)$ , $R_y(y)$ , and $R_z(z)$ or a single rotation by the sum of axis-specific angles (Wang et al., 17 Jun 2025, Li et al., 18 Mar 2025, Feng et al., 24 Mar 2025).

For joint geometric encodings:

$Q' = \mathbf{R}(x,y,z) Q, \quad K' = \mathbf{R}(x,y,z) K$

with $\mathbf{R}$ as above (Schenck et al., 4 Feb 2025, Yao et al., 4 Dec 2025).

In all cases, the value projection ( $V$ ) is left unrotated.

This structure is compatible with all variants of transformer-based attention, including self-attention, cross-attention, and multi-attention blocks (e.g., RomanTex MVA) (Feng et al., 24 Mar 2025).

The overhead incurred is negligible—typically two $D\times D$ block-diagonal multiplies per token. No extra trainable parameters are introduced, and memory is dominated by the need to store sin/cos tables for the set of positions times the number of frequencies (Wang et al., 17 Jun 2025).

4. Empirical Performance and Benchmarks

The adoption of 3D RoPE yields consistent improvements across diverse domains:

Task/Domain	Baseline PE	3D-RoPE Variant	Gain (%)	Reference
Video MIR (EK-100, mAP, zero-shot)	split-3×RoPE 32.9	Joint 3D ST-RoPE 34.3	+1.4	(Wang et al., 17 Jun 2025)
Video MIR (EK-100, fine-tune)	MI-MM loss 51.8	EVA02-AT+SMS 59.0	+7.2	(Wang et al., 17 Jun 2025)
Open-vocab 3D Box Prediction	Baseline 49.77	Circulant-STRING 58.95	+9.18	(Schenck et al., 4 Feb 2025)
Medical Segmentation (Dice, qualitative)	APE	3D-RoPE	improved boundary / Dice +1.1*	(Li et al., 18 Mar 2025)
3D Texture Synthesis (LAD)	MVA only 0.123	3D-RoPE+MVA 0.119	~5% lower LAD	(Feng et al., 24 Mar 2025)
Video-LLM Retrieval (long video, Video-NIAH)	RoPE-3D 72.81	VRoPE 87.03	+14.22	(Liu et al., 17 Feb 2025)
Robotics—Multi-task Success	RoPE 41.7	STRING 45.8	+4.1	(Schenck et al., 4 Feb 2025)

*RoMedFormer ablation; not in main paper.

Ablation studies consistently demonstrate gains for full 3D rotary encodings over split or axis-wise versions, particularly in capturing cross-axis relationships, improving spatial/temporal localization, and reducing artifacts (e.g., view seams, texture discontinuities, or attention bias) (Wang et al., 17 Jun 2025, Yao et al., 4 Dec 2025, Feng et al., 24 Mar 2025, Liu et al., 17 Feb 2025).

5. Variants and Extensions

Several frameworks have extended 3D-RoPE:

STRING: A generalization using commuting generators in Lie algebra; enables fast Cayley or circulant implementations, and supports arbitrary additional positioning modalities such as depth or semantics (Schenck et al., 4 Feb 2025).
LieRE: Parameterizes trainable 3D rotations via exponentiation of learned skew-symmetric matrices, achieving translation equivariance on $SO(3)$ (Ostmeier et al., 14 Jun 2024).
GeoPE: Constructs rotations as quaternionic sandwich products, ensuring geometrically isotropic positional coupling via so(3) mean-averaged phases—a critical advance over axis-wise RoPE for 3D spatial and shape-sensitive applications (Yao et al., 4 Dec 2025).
VRoPE: Designed for Video-LLMs, introduces continuity and symmetry index transforms over 3D+1 (spatiotemporal+text) positional indices to mitigate bias and enable seamless cross-modal attention (Liu et al., 17 Feb 2025).
3D-RPE (Bloch Sphere): Inspired by quantum state embeddings, splits long sequences into two-angle (polar + azimuthal) rotations for improved position resolution and decay, outperforming RoPE in long-context NLU and LM (Ma et al., 14 Jun 2024).
RomanTex: Incorporates 3D-aware RoPE in multi-attention blocks for consistent, geometry-aware texture synthesis on 3D assets; canonicalizes pixel-wise 3D positions through coordinate maps (Feng et al., 24 Mar 2025).

6. Design Challenges and Controversies

A recurring limitation among naive 3D RoPE designs is their tendency to treat the axes independently, thereby failing to encode true Euclidean or cross-modal relationships. This issue was addressed by GeoPE and LieRE through geometric coupling and Lie algebraic averaging, and by VRoPE through symmetric index interleaving and cross-modal continuity (Yao et al., 4 Dec 2025, Liu et al., 17 Feb 2025, Ostmeier et al., 14 Jun 2024). Furthermore, chunking strategies in quantum-inspired 3D-RPE frameworks improve long-term attention decay and position interpolation in LLMs, outperforming Pi/NTK scaling (Ma et al., 14 Jun 2024).

Another challenge is the non-commutativity of rotations in three dimensions. GeoPE resolves this by mapping quaternions to the Lie algebra, averaging, and re-exponentiating to form a symmetric, isotropic rotation, in contrast to the bias introduced by ordered axis-wise composition (Yao et al., 4 Dec 2025).

Computational efficiency is generally preserved; all practical schemes ensure $O(Nd)$ complexity per layer, with no extra parameters, and exploit fast table lookup or FFT tricks where possible (Wang et al., 17 Jun 2025, Schenck et al., 4 Feb 2025).

7. Applications and Outlook

3D Rotary Positional Embeddings are now foundational in diverse domains:

Video and Video-LLMs: State-of-the-art results in egocentric video-language understanding and retrieval (Wang et al., 17 Jun 2025, Liu et al., 17 Feb 2025).
Medical Volumetric Segmentation: Enhanced 3D boundary modeling and organ localization (Li et al., 18 Mar 2025).
3D Object Detection and Robotics: Superior geometric reasoning and out-of-distribution robustness in robotic policies and object detection, with notable gains over RoPE and absolute PEs (Schenck et al., 4 Feb 2025, Yao et al., 4 Dec 2025).
Texture Synthesis and Multiview Generation: Reduced seam artefacts and improved coherence in 3D asset generation (Feng et al., 24 Mar 2025).
Long-context LLMs and Quantum-inspired Extensions: Reliable scaling to 100k+ token contexts without compromising attention resolution (Ma et al., 14 Jun 2024).
Geometric and Shape Bias in Vision: Improved shape-based cues and Euclidean spatial generalization (Yao et al., 4 Dec 2025).

Further generalizations to higher-order geometries (e.g., 4D, arbitrary Lie groups) and cross-modal multimodal fusions are anticipated (Liu et al., 17 Feb 2025, Yao et al., 4 Dec 2025).