4D Rotary Position Embeddings

Updated 20 September 2025

4D RoPE is a method that extends rotary position embeddings to multiple dimensions, encoding spatial, temporal, and modal positions using independent, deterministic rotations.
It partitions the embedding space into distinct subgroups and applies block-diagonal rotations along each positional axis to maintain relative geometric consistency.
Experimental results show that 4D RoPE enhances training speed, reduces error rates, and effectively models long-range dependencies in diverse Transformer applications.

Rotary Position Embeddings (RoPE) generalize positional encoding in Transformers by replacing the standard additive or learned absolute positional approaches with deterministic, continuous, and inherently relative phase rotations in feature space. In their canonical form, RoPE partitions the model’s feature dimension into two-dimensional subspaces and imparts each with a rotation parameterized by position. While the original construction handles 1D sequences, the RoPE framework naturally extends to higher dimensions—including 2D (images), 3D (spatiotemporal or spherical coordinates), and 4D or beyond (video, geospatial, or multidimensional trajectories)—by structuring the rotations along multiple independent coordinate axes, each corresponding to a positional dimension. The emergence of 4D or multi-dimensional RoPE is crucial for models operating over video, geospatial, medical, or complex multimodal data, enabling contextually robust, parameter-efficient, and theoretically principled positional encoding.

1. Mathematical Structure of Multidimensional and 4D RoPE

At its core, RoPE encodes absolute positional information via repeated application of block-diagonal rotation matrices. For position $m \in \mathbb{N}$ and embedding $x \in \mathbb{R}^d$ (assuming $d$ even), RoPE processes every pair as: $\left[ \begin{array}{c} x_{2i}^\prime \ x_{2i+1}^\prime \end{array} \right] = \left[ \begin{array}{cc} \cos(m \theta_{i}) & -\sin(m \theta_{i}) \ \sin(m \theta_{i}) & \cos(m \theta_{i}) \end{array} \right] \left[ \begin{array}{c} x_{2i} \ x_{2i+1} \end{array} \right]$ with $\theta_i = 10000^{-2i/d}$ .

For multi-dimensional positions $x = (x^{(1)},\ldots,x^{(D)})$ , the Axial RoPE generalizes the above by splitting the embedding into $D$ groups and independently rotating each group using position-specific angles for each axis: $z_{j}^{(d)} \rightarrow R^{x^{(d)}} z_{j}^{(d)}$ where $R$ rotates each subgroup of features by the appropriate position.

In 4D RoPE, this procedure yields an embedding space partitioned into four subgroups (corresponding, for example, to width, height, time, channel), each with its unique positional rotation. The attention between queries and keys is then expressed in terms of their relative geometric positions, maintaining the core RoPE property: $f_q(x_q,m)^\top f_k(x_k,n) = x_q^\top R_{n-m} x_k$ where $R_{n-m}$ is the relative rotation composed along all axes (potentially an exponential, if using matrix exponentials as in ComRoPE (Yu et al., 4 Jun 2025)).

2. Geometric and Physical Interpretability in Higher Dimensions

RoPE’s rotation-based parameterization allows an explicit geometric interpretation for multi-dimensional data. For spherical (geospatial) coordinates, the embedding rotation is defined by Euler angles—most typically longitude and latitude—leading to a block-diagonal matrix with each $3\times3$ block parameterized by the physical coordinate: $R(\theta, \phi) = \begin{bmatrix} \cos\theta & -\cos\phi\sin\theta & \sin\phi\sin\theta \ \sin\theta & \cos\phi\cos\theta & -\sin\phi\cos\theta \ 0 & \sin\phi & \cos\phi \end{bmatrix}$ This approach guarantees that Euclidean distance in the embedding space reflects the angular (and thus physical) geospatial distance (Unlu, 2023, Unlu, 23 Mar 2024).

Video and spatiotemporal data necessitate further extension to 4D rotary embeddings. Methods such as VRoPE (Liu et al., 17 Feb 2025) construct a 4D index for each token: spatial coordinates (width, height), temporal index (frame), and a cross-modal dimension for video-text transitions. Rotations are assigned to guarantee smooth locality-preserving transitions across all dimensions and modalities.

3. Implementation in Transformer Architectures and Extensions

To integrate 4D RoPE in Transformers, the embedding is divided according to the number of positional axes; each group undergoes its respective rotation before being projected for attention:

In video LLMs, VRoPE (Liu et al., 17 Feb 2025) transforms spatial indices into composite (u,v) coordinates via $[u;v] = [w+h, w-h+H-1]$ , applies symmetric positive/negative rotations, and ensures continuity at video-text boundaries—a 4-way partition spanning spatial and cross-modal alignment.
In masked autoencoders for regular and irregular multidimensional data, RoMAE (Zivanovic et al., 26 May 2025) utilizes Axial RoPE with continuous-valued positions, facilitating learning across arbitrary multi-channel settings (including 4D).
ComRoPE (Yu et al., 4 Jun 2025) further generalizes rotation matrices from fixed parameters to learnable, pairwise-commuting skew-symmetric matrices, exponentially increasing flexibility and allowing the use of trainable angles in arbitrary dimensions. The commutativity constraint guarantees that attention depends only on the positional offset and maintains robust relative encoding.

Table: Illustration of 4D RoPE Integration by Application Domain

Domain	Positional Axes	Partitioning in Embedding	Rotational Construction
Geospatial	Longitude, Latitude	Groups of 3	Spherical Euler-angle blocks (Unlu, 2023, Unlu, 23 Mar 2024)
Video LLMs	Width, Height, Frame, Modality	Groups of 4	VRoPE multi-axis with cross-modal transitions (Liu et al., 17 Feb 2025)
Irregular Multivariate	Each signal dimension	D groups	Axial RoPE; continuous position input (Zivanovic et al., 26 May 2025)
General ND Data	All axes	D groups/blocks	ComRoPE trainable matrices per axis (Yu et al., 4 Jun 2025)

4. Experimental Results and Empirical Properties

In speech and ASR tasks, multidimensional RoPE achieves lower error rates and accelerates training. For Conformer models, RoPE produces up to 21% faster training and consistently lower WERs versus relative position embeddings (Zhang et al., 10 Jan 2025, Li et al., 2021).
In video-LLMs, VRoPE yields superior performance in both general video understanding and long-context retrieval, achieving high retrieval accuracy even for lengthy videos and outperforming RoPE-3D (Liu et al., 17 Feb 2025).
In geospatial transformers, spherical RoPE encodings yield lower loss in spatial distance prediction, confirming proportionality between physical and embedding-space distances (Unlu, 2023, Unlu, 23 Mar 2024).
In multidimensional masked autoencoders, RoMAE surpasses specialized sequence models in irregular time-series and maintains competitive performance in vision tasks (Zivanovic et al., 26 May 2025).
When using trainable commuting rotation matrices (ComRoPE), classification accuracy improves over state-of-the-art fixed-parameter alternatives, with gains magnified in high-resolution and high-dimensional (3D/4D) settings (Yu et al., 4 Jun 2025).

5. Analysis, Interpretation, and Theoretical Limitations

The RoPE mechanism, even in 4D, is associated with critical theoretical properties:

Maintaining attention as a function solely of relative offset enables invariance to absolute position and supports robust length/generalization extrapolation (Su et al., 2021, Zhong et al., 19 Jun 2024).
Multi-resolution and wavelet-like properties emerge naturally as each rotary frequency encodes a different scale. This decomposition underpins the transformer's ability to balance local (high-frequency) and global (low-frequency) context, resembling scale-space or wavelet analysis (Ruscio et al., 23 Oct 2024, Oka et al., 4 Feb 2025).
Empirically, the higher-dimensional (low-frequency) components are pivotal in modeling long-range dependencies, especially in the "positional heads" of long-context LLMs (Hong et al., 11 Oct 2024). However, dimension underutilization is an observed phenomenon: higher-frequency axes (first dimensions of rotary blocks) contribute little at long distances, implying a practical inefficiency for long-context retrieval (Chiang et al., 16 Feb 2025).
RoPE-based Transformers, despite their empirical generalization strength, have formally bounded circuit complexity, with constant-depth threshold circuit expressivity unless architectural or depth modifications are made (Chen et al., 12 Nov 2024).

6. Implementation, Deployment, and Practical Considerations

Accurate and efficient implementation of 4D RoPE includes careful partitioning of the embedding, rotational computation (preferably using block-diagonal matrix operations), and maintaining commutativity in learnable extensions.
Techniques such as Fast RoPE Attention (Alman et al., 17 May 2025) accelerate attention calculation in the presence of rotary embeddings by polynomial approximation and FFT of Toeplitz-like matrices, enabling near-linear scaling—crucial for long multimodal contexts.
Integration in major frameworks (e.g., Hugging Face Transformers for RoFormer, SpeechBrain for speech, and open-sourced VRoPE/ComRoPE for video and high-D vision models) streamlines adoption across language, speech, and vision domains (Su et al., 2021, Zhang et al., 10 Jan 2025, Liu et al., 17 Feb 2025, Yu et al., 4 Jun 2025).
Proper tuning of positional parameters, axis partitioning, and, when necessary, positional ID remapping (to prevent attention decay across modalities in LVLMs; see ID-Align (Li et al., 27 May 2025)) is vital for high-dimensional integration.

7. Variants, Extensions, and Future Research Directions

Advanced adaptations—CARoPE (context-aware RoPE) (Veisi et al., 30 Jul 2025), learnable matrix exponentials (ComRoPE) (Yu et al., 4 Jun 2025), and hybrid/cone-like mappings for cross-modal alignment (Circle-RoPE) (Wang et al., 22 May 2025)—demonstrate a trend toward more adaptive, robust, and context-sensitive multi-dimensional rotation schemes. Open problems include:

Generalizing the commutativity constraint in trainable matrix exponentials to further expand the expressivity of rotary embeddings while maintaining the key relative property.
Addressing dimension underutilization by frequency remapping or hybridizing RoPE with multi-scale (e.g., wavelet-based) positional representations (Oka et al., 4 Feb 2025).
Efficient implementation for streaming and real-time inference in high-dimensional, long-context tasks, leveraging structure-aware algorithms like Fast RoPE Attention (Alman et al., 17 May 2025).
Automated positional remapping and context conditioning to mitigate attention decay and cross-modal bias in dense and multimodal regimes (Li et al., 27 May 2025, Wang et al., 22 May 2025).

In summary, 4D Rotary Position Embeddings operationalize position encoding in arbitrarily high dimensions via deterministic, composable, and often learnable rotations. This enables robust, efficient, and scalable modeling of sequence, spatial, and spatiotemporal positional dependencies in modern Transformer architectures.