4D Rotary Position Embedding (4D RoPE)

Updated 30 November 2025

4D RoPE is a generalization of rotary position encoding that injects 4D positional information via multidimensional rotations, ensuring attention depends only on relative differences.
It employs both a block-diagonal method with fixed sinusoidal frequencies and a learnable Lie-algebraic approach to encode high-dimensional coordinates.
Empirical results show that integrating 4D RoPE into Transformers significantly improves spatiotemporal detection metrics and overall model efficiency.

A 4D Rotary Position Embedding (4D RoPE) generalizes the rotary position encoding paradigm, originally devised for one-dimensional token sequences, to settings where each token is associated with a position in four-dimensional coordinate space. This approach enables Transformer architectures to encode relative positional biases across modalities requiring high-dimensional structure, such as spatiotemporal detection or volumetric data, retaining the theoretical advantages of RoPE—namely, relative-only attention dependence, continuity beyond fixed grid resolutions, and compatibility with orthogonality-preserving attention mechanisms. Unlike fixed or learnable absolute embeddings, 4D RoPE injects positional information by applying multidimensional rotations to query and key vectors in attention modules, yielding representations intrinsically sensitive to spatial and/or temporal displacements.

1. Theoretical Frameworks for 4D RoPE

Rotary Position Embedding (RoPE) originates from [RoFormer, (Su et al., 2021)], encoding scalar or vector positional indices by rotating each 2D subspace of a model’s embedding space via block-diagonal orthogonal matrices. RoPE’s critical property is that the resulting inner product between rotated queries and keys depends solely on relative position differences, not absolute locations.

For the extension to four dimensions, two major frameworks are established. The first—block-diagonal 4D RoPE—constructs the global rotation as a direct product of independent 2D plane rotations, assigning each of the four positional coordinates (e.g., spatial axes and time) to distinct 2D rotational subblocks. The resulting full rotation matrix $R_{4D}(p)$ is built by stacking these blocks, ensuring orthogonality and the relative-only dependence (Su et al., 2021, Ji et al., 17 Apr 2025). Mathematically, for token position vector $p\in\mathbb{R}^4$ , the rotation is:

$R_{4D}(p) = \mathrm{blockdiag}(R_{01}(p_1),\,R_{23}(p_2),\,R_{02}(p_3),\,R_{13}(p_4))$

where each $R_{ij}(p_k)$ rotates the $(i, j)$ -plane by angle $p_k \theta^{(k)}$ .

The second framework generalizes beyond block-diagonality, learning a linear map from the token’s 4D coordinates directly to a $4\times 4$ skew-symmetric matrix in the Lie algebra $\mathfrak{so}(4)$ , as in Lie Relative Encodings (LieRE) (Ostmeier et al., 14 Jun 2024). The rotation applied is then:

$R(x) = \exp(Ax)$

where $A: \mathbb{R}^4 \rightarrow \mathfrak{so}(4)$ is a learned map, and $x$ is the 4D position. This approach exploits the full expressive capacity of the special orthogonal group SO(4), but requires computation of the matrix exponential.

2. Explicit Construction and Implementation

The block-diagonal (plane-wise) 4D RoPE is implemented by splitting the model’s hidden dimension $d$ into groups (typically with $d$ divisible by $8$), assigning each quarter of the embedding to a coordinate, and applying pairwise 2D rotations with frequency-scaled angles. Sinusoidal frequencies are typically drawn from fixed exponential schedules analogous to original RoPE, e.g. $\omega^x_i = 10000^{-2i/(d/4)}$ for $i=0, \ldots, d/4-1$ .

For a query $q\in\mathbb{R}^d$ at 4D position $p=(x, y, t, c)$ (with $c$ a dummy axis for alignment), $q$ is partitioned as $q = [q^{(x)},q^{(y)},q^{(t)},q^{(c)}]$ , and each $q^{(\cdot)}$ is sub-divided into 2D pairs. Each pair $(q^{(\alpha)}_{2i},q^{(\alpha)}_{2i+1})$ is rotated by an angle $\theta^\alpha_i = p_\alpha \omega^\alpha_i$ , yielding the rotated query. This same construction is mirrored for key vectors.

The inner product between rotated queries and keys yields an explicit attention bias as:

$\widetilde q^\top \widetilde k = \sum_{\alpha\in\{x,y,t,c\}}\sum_{i} \langle q^{(\alpha)}_{2i:2i+1}, k^{(\alpha)}_{2i:2i+1} \rangle \cos(\theta^\alpha_i(q) - \theta^\alpha_i(k))$

ensuring the self-attention is sensitive to the relative offset along each coordinate.

In the LieRE formulation, $A(x)$ is a linear combination of the six standard generators $E_{ij}$ of $\mathfrak{so}(4)$ , with learned coefficients $w_{ij}\cdot x$ . The rotation matrix $R(x)$ is then the matrix exponential of this skew-symmetric matrix, efficiently computable via scaling-and-squaring or diagonalization. Application of $R(x)$ to queries and keys is standard matrix multiplication.

3. Integration into Transformer Architectures

Incorporation of 4D RoPE into Transformer models requires minimal architectural modifications. In (Ji et al., 17 Apr 2025), 4D RoPE replaces 1D or 2D rotary position encodings in the attention modules. Position information is propagated through the model as a 4D vector (e.g., $(x, y, t, 0)$ ), from which the corresponding rotations are computed as described above and applied to all queries and keys prior to dot-product attention evaluation.

In the case of streaming architectures such as StreamPETR, the normalized BEV spatial coordinates and timestamps are used to formulate the 4D positional vector, allowing the model to explicitly correlate both spatial alignment and temporal evolution within attention (Ji et al., 17 Apr 2025). This enables direct integration of spatiotemporal cues with no auxiliary branches or parameters beyond the trigonometric operations for subvector rotation.

Similarly, the LieRE approach (Ostmeier et al., 14 Jun 2024) generalizes this integration to arbitrary $d$ -dimensional coordinate spaces, as long as each token is associated with a coordinate $x \in \mathbb{R}^d$ and the rotation dimension $D$ is matched to the per-head embedding size.

4. Empirical Performance and Properties

Experimental results demonstrate the empirical benefits of 4D RoPE, particularly in camera-only 3D object detection and dense, high-dimensional classification tasks. In RoPETR, augmenting StreamPETR with 4D RoPE yields up to 30% reduction in mean Average Velocity Error (mAVE) on the NuScenes challenge compared to the baseline, while improving the NuScenes Detection Score (NDS) to 70.9% (Ji et al., 17 Apr 2025). These improvements are attributed to the rotary encoding’s ability to smoothly encode both spatial proximity and temporal consistency, critical for velocity estimation in dynamic scenes.

In LieRE, generalization to 2D and 3D image tasks achieves 1.5% and 1% improvements over state-of-the-art baselines, along with superior extrapolation to resolutions outside the training grid. Training is efficient: results on CIFAR100 are reported as reproducible in under 30 minutes on 4 A100 GPUs (Ostmeier et al., 14 Jun 2024).

Key theoretical and practical properties of 4D RoPE include:

Relative-only dependence: Attention scores depend only on relative position differences.
Generalization to novel resolutions: The learned or fixed frequency maps can be evaluated for arbitrary (including out-of-distribution) positions.
Efficiency: No $O(N^2)$ relative bias tables; total cost is $O(N D^3)$ to compute rotations, where $N$ is sequence length and $D$ the head size.
Modality-agnostic: Construction and implementation are agnostic to spatial, spatiotemporal, or other continuous coordinate systems.

5. Comparison of Fixed (Sinusoidal) vs Lie-Algebraic 4D RoPE

Two approaches to 4D RoPE dominate the literature:

Method	Parameterization	Computational Cost
Block-diagonal RoPE	Fixed sinusoidal, 2D per coordinate axis	$O(d)$ per rotation
LieRE (SO(4) RoPE)	Learned linear map to so(4); exponentiate	$O(D^3)$ per token

The block-diagonal method retains original RoPE’s parameter-free structure and is straightforward to implement. The Lie-algebraic approach offers greater expressive capacity by learning arbitrary combinations of 4D plane rotations, at the expense of more complex matrix exponential evaluations (Ostmeier et al., 14 Jun 2024).

6. Research Impact and Extensibility

The introduction and empirical validation of 4D RoPE have directly enabled Transformer-based architectures to handle spatiotemporal detection, volumetric image and video tasks, and any domain requiring structured coordinate encoding over four dimensions. The approach is modality-agnostic, extending seamlessly to arbitrary $d$ -dimensional settings contingent on the sufficiency of coordinate information for each token (Ostmeier et al., 14 Jun 2024).

Both fixed-frequency (sinusoidal) and learnable Lie group variants preserve compatibility with attention mechanism properties, such as orthogonality and linearized kernels, broadening their applicability to efficient or scalable attention models (Su et al., 2021).

A plausible implication is that as downstream tasks demand increasingly sophisticated positional biases—across space, time, or more abstract coordinates—these high-dimensional RoPE strategies will become central to model design, superseding ad hoc token or patch embedding schemes.