4D Rotary Position Embeddings

Updated 19 September 2025

4D Rotary Position Embeddings are an advanced encoding method that extends traditional RoPE to capture spatial and temporal signals using four-dimensional rotations.
They use techniques such as block-diagonal and axial rotations to independently encode multiple axes while preserving relative position and translation invariance.
Practical implementations in 4D scene understanding, egocentric video, and multimodal fusion have shown significant improvements in accuracy, efficiency, and scalability.

A 4D Rotary Position Embedding (4D RoPE) generalizes the original rotary position encoding mechanism—first introduced in the context of LLMs as “rotations” in paired embedding dimensions—to higher-dimensional, specifically four-dimensional, spatial and temporal contexts. This extension enables Transformers and related architectures to process and reason over spatial, temporal, and other multi-axial signals with robust, relative, and expressive positional representations. The concept subsumes conventional 1D or 2D RoPE schemes and supports modular, scalable, and theoretically grounded approaches to positional encoding in domains such as video, 3D/4D scene understanding, multimodal fusion, and spatiotemporal sequence analysis.

1. Mathematical Foundations of Rotary Position Embeddings

The rotary position embedding mechanism encodes an absolute position by assigning each embedding subspace a position-dependent rotation. In standard RoPE, the $d$ -dimensional vector is split into $d/2$ pairs, each rotated by a unique angle determined by the position index. For position $m$ :

The $i$ -th 2D embedding pair is rotated using:

$\mathbf{R}_{\theta_i}(m) = \begin{bmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \ \sin(m\theta_i) & \cos(m\theta_i) \end{bmatrix}$

where $\theta_i = 10000^{-2i/d}$ .

When attention is computed, the inner product between two rotated vectors at positions $m,n$ becomes:

$\langle \mathbf{q}_m, \mathbf{k}_n \rangle = \mathbf{q}^\top \mathbf{R}_{\theta}(n-m) \mathbf{k}$

showing that only the relative position $n-m$ remains after combining rotations.

In the 4D extension, the embedding is subdivided into tuples of four (or multiple 2D/3D blocks), with each 4-tuple associated with a block-diagonal 4D rotation matrix parameterized by four (or more) independent angles. The general form is:

$\mathbf{R}(x) = \exp\left( \sum_{i=1}^4 A_i x_i \right)$

where $x = (x_1, x_2, x_3, x_4)$ is the 4D position coordinate and $A_i$ are commuting skew-symmetric matrices (Yu et al., 4 Jun 2025). This construction leverages Lie group and Lie algebra properties to ensure compositionality and computational efficiency (Ostmeier et al., 14 Jun 2024).

2. 4D Rotary Position Embeddings: Design and Variants

The extension of RoPE to four dimensions can be instantiated in several ways, typically to capture spatial (x, y, z) and temporal (t) dimensions:

Block-Diagonal 4D Rotation: The embedding dimension is divided into blocks of size four. Each block undergoes a rotation following 4D rotational group SO(4) structure, parameterized by multiple angles, each encoding one of the spatial or temporal axes (Unlu, 2023).
Axial RoPE: The embedding is separated into different subspaces along each axis (spatial dimensions and time). Each axis’s rotary transformation is applied independently in each block, allowing independent encoding of x, y, z, and t, and then composed together, sometimes by multiplication or addition of the respective block matrices (Zivanovic et al., 26 May 2025).
Lie Group-Based High-Dimensional RoPE: A learnable linear mapping projects the 4D position vector into a skew-symmetric generator space, and the rotation matrix is obtained as the matrix exponential of the sum: $R(x) = \exp(Ax)$ , ensuring that relative encoding $R(x)^\top R(y) = R(y-x)$ holds (Ostmeier et al., 14 Jun 2024).

Different designs may focus on preserving the proportionality between geometric or temporal distance and distance in embedding space (e.g., 4D spherical or spatiotemporal encodings), on translation invariance, or on computational efficiency.

3. Properties and Theoretical Guarantees

Key theoretical properties and guarantees for 4D RoPE and its higher-dimensional analogs include:

Relative Position Encoding: As with the 1D case, combining rotations at positions $x$ and $y$ leads to a relative rotation encoding $y-x$ , preserving translational invariance and ensuring that attention inherently encodes relative, not absolute, location (Gao et al., 11 May 2024, Zivanovic et al., 26 May 2025, Yu et al., 4 Jun 2025).
Commutativity Requirement: For $N$ -dimensional RoPE, the parameter matrices $A_i$ must pairwise commute: $A_i A_j = A_j A_i$ for all $i,j$ . This ensures that all orderings of axis-wise rotations result in the same total rotation, which is necessary for proper relative encoding (Yu et al., 4 Jun 2025).
Decoupling and Flexibility: Embeddings can be structured so that cross-modal or cross-axis decoupling is achieved (e.g., via circular or cone-like projections in Circle-RoPE for vision-language or spatiotemporal settings), reducing spurious interdependence between unrelated modalities or axes (Wang et al., 22 May 2025).
Admissibility for Continuous and Irregular Positions: RoPE and its extensions can be defined for real-valued (continuous) positions, enabling operation on non-grid, irregularly sampled, or natural coordinate data (e.g., spatial lattices, event timestamps, 4D scenes) (Zivanovic et al., 26 May 2025).
Maintenance of Rotational Invariance: Provided the commutativity property, the relative position property is maintained, and model predictions remain invariant to global translations in any axis (Gao et al., 11 May 2024, Unlu, 2023).

4. Practical Implementations and Applications

Recent work has adapted and implemented 4D and high-dimensional rotary position embeddings in several application domains:

Domain/Task	4D RoPE Application & Main Features	Reference
4D Scene Understanding	Spatiotemporal prompts encoding (x, y, z, t) injected into visual-language features; Fourier and motion-aware encodings for each dimension	(Zhou et al., 18 May 2025)
Egocentric Video	Joint spatial-temporal rotary embedding, integrating across all feature dimensions; no manual splitting required; enhances video-language foundation models	(Wang et al., 17 Jun 2025)
Multimodal Vision-Language	Decoupled circular/cone-like 4D rotary encoding for spatial tokens; achieves cross-modal bias elimination and spatial consistency	(Wang et al., 22 May 2025)
Multidimensional Time-Series/Image/Audio	Axial RoPE—applying rotary on each axis (time, channel, etc.) independently, handles arbitrary D, supports irregular and continuous positions	(Zivanovic et al., 26 May 2025)
Hybrid Transformer-SSM	Unified RoPE, using identical 4D rotary transformation for both attention and state-space layers, resolving positional incompatibility	(Wu et al., 11 Jun 2025)

Such 4D RoPE adaptations have demonstrated increased modeling fidelity in spatiotemporal inference, improved robustness against timestamp translation and cross-modal misalignment, and enhanced ability to capture dynamic scene information and complex interactions.

5. Limitations and Challenges

Several limitations and challenges for 4D RoPE and its variants have been noted:

Block-Dimensionality Constraints: Embedding dimensionality must often be a multiple of 4 (or 3 for 3D spherical encodings); this can complicate integration with pre-existing architectures and pretrained weights (Unlu, 2023).
Higher-Dimensional Rotation Complexity: The structure of 4D rotations is more intricate (e.g., SO(4) with six rotation planes), and parameterizing these efficiently while maintaining interpretability and computational tractability is nontrivial (Ruscio et al., 23 Oct 2024, Ostmeier et al., 14 Jun 2024).
Interaction Between Axes/Modalities: Naive composition of independent rotations can result in unintended cross-axis dependencies. Approaches like Circle-RoPE mitigate this by projecting modalities into orthogonal affine subspaces before rotary encoding (Wang et al., 22 May 2025).
Learning and Expressivity: Fixed-frequency schemes may underutilize certain rotated dimensions in long-sequence tasks; learned or context-aware frequency selection (as in LieRE or CARoPE) address these limitations, but may introduce additional implementation complexity (Chiang et al., 16 Feb 2025, Ostmeier et al., 14 Jun 2024, Veisi et al., 30 Jul 2025).
Proportionality and Physical Distance: Ensuring that the distance between rotation-embedded tokens matches the actual physical or semantic distance is not automatic for all rotation parameterizations, and explicit calibration may be needed in application (Unlu, 2023).

6. Notable Experimental Results

Published empirical results have established that 4D RoPE and its generalized forms enable state-of-the-art or superior performance in several challenging settings:

In 4D scene-language alignment and grounding, spatiotemporal prompt injection based on 4D rotary encodings yields significant improvements over 3D-only schemes on spatial/temporal grounding accuracy and captioning scores (Zhou et al., 18 May 2025).
In multi-instance egocentric video-language retrieval tasks, joint spatial-temporal rotary embeddings combined with unified attention modules outperform baselines by 7–9% in mean Average Precision across major datasets (Wang et al., 17 Jun 2025).
Circle-RoPE achieves Per-Token Distance (PTD) of zero and consistent performance gains on vision-language reasoning (e.g., 52.11 on MMMU validation vs. 50.22 for hard embedding) (Wang et al., 22 May 2025).
LieRE and ComRoPE, both using high-dimensional (including 4D) rotations parameterized by learnable (commuting) matrices, report consistent accuracy increases (up to 2.9% at high resolutions on ImageNet-1K and >1% on UCF-101 for video) over conventional RoPE (Ostmeier et al., 14 Jun 2024, Yu et al., 4 Jun 2025).
Hybrid models (TransXSSM) with Unified RoPE achieve 42.3% and 29.5% speed improvements over Transformers alone for 4K-length sequences while outperforming both pure self-attention and SSM baselines in accuracy (Wu et al., 11 Jun 2025).

7. Future Directions

Learnable and Context-Aware Frequency Modulation: Integrating content-sensitive rotary frequencies (as in CARoPE (Veisi et al., 30 Jul 2025)) can further generalize 4D RoPE, providing token- and head-specific encodings that modulate over multiple spatial/temporal axes.
Continual and Irregular Position Handling: Full support for continuous, irregular, and high-dimensional coordinate systems is a clear direction, with Axial RoPE and LieRE offering theoretically grounded mechanisms.
Cross-Modal and Cross-Axis Decoupling: Ensuring that rotary encodings decouple unrelated structure (e.g., video-text, time-space) without discarding critical intra-modality (or intra-axis) relationships is a focus; solutions include geometric projections (Circle-RoPE), dual-frame fusions, and alternating encoding geometries.
Combined Learnable and Physical Geometry: Some works combine explicit physical parameters (positions, time, velocities) with learnable projections, balancing feature learning and physical interpretability.
Efficient Implementation for High-Dimensional Data: Memory- and computation-efficient implementations are critical for scaling 4D RoPE to long contexts, videos, and high-resolution spatiotemporal inputs.

In summary, 4D Rotary Position Embeddings extend the rotation-based relative positional encoding principle to high-dimensional contexts, enabling powerful, flexible, and theoretically justified representations for spatiotemporal, multimodal, and structured sequence data. Recent technological advances leverage Lie group theory, learnable angle matrices, context-aware modulations, and geometrical projections to support a large spectrum of modern machine learning applications across language, vision, audio, and dynamical systems (Su et al., 2021, Unlu, 2023, Ostmeier et al., 14 Jun 2024, Zhou et al., 18 May 2025, Wang et al., 22 May 2025, Yu et al., 4 Jun 2025, Wang et al., 17 Jun 2025, Veisi et al., 30 Jul 2025).