Omni-Modality 3D RoPE

Updated 18 November 2025

Omni-Modality 3D RoPE is a unified positional encoding scheme that integrates time, height, and width to support effective cross-modal reasoning.
It applies independent sine-cosine planar rotations on token subcomponents to preserve relative positional invariance across text, image, audio, and video data.
Empirical results show a +7% improvement over previous models on video benchmarks, underscoring its impact on spatio-temporal data alignment.

Omni-Modality 3D Rotary Positional Encoding (3D RoPE) is a unified positional encoding scheme introduced in the Uni-MoE-2.0-Omni model to support spatio-temporal cross-modality alignment within large-scale omnimodal transformer architectures. Unlike conventional 1D or 2D positional encodings, 3D RoPE enables a transformer to jointly process text, image, audio, and video data by embedding position along orthogonal axes of time, height, and width directly into the multi-head self-attention mechanism. This method preserves the relative positional properties necessary for effective cross-modal reasoning and long-context understanding while providing a shared alignment space across previously disparate modalities (Li et al., 16 Nov 2025).

1. Motivation for 3D RoPE in Omnimodal Transformers

Traditional 1D rotary or absolute positional encodings are limited to sequential time or token axis indexing and are suitable mainly for text. They fail to capture spatial structure (as in images, which require both height and width), and omit temporal structure necessary for video and audio streams. 2D RoPE extends to cover two spatial dimensions, but still omits true temporal encoding, which leads to misalignment between temporal progression (such as video frames or audio chunks) and spatial locations (such as image or video patches).

The objectives of 3D RoPE are:

To provide a modality-agnostic positional embedding applicable to text, images, audio, and video, parameterized by three orthogonal axes: Time ( $T$ ), Height ( $H$ ), Width ( $W$ ).
To preserve the relative-position property of rotary embeddings, where inner-product shifts remain consistent for relative positional changes.
To enable spatio-temporal cross-modal alignment, ensuring, for example, that a video patch at $t=10\,\mathrm{s}$ aligns meaningfully with the co-occurring audio chunk covering $10\text{--}13\,\mathrm{s}$ .

2. Mathematical Formulation

Let $d$ be the per-head dimensionality and assume $d$ is divisible by $6$. Each attention head’s input vector is decomposed into three equal-size chunks, corresponding to $T$ , $H$ , and $W$ , and each chunk is further split into paired subcomponents. Standard 2-dimensional rotary transformations (sine/cosine pairwise rotations) are applied independently to each.

For 1D RoPE (single axis), with position $p$ and channel pair $(q_{2i}, q_{2i+1})$ : $\theta^{(p)}_i = \frac{p}{10000^{2i/d}}, \quad \begin{pmatrix} \hat q_{2i} \ \hat q_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos\theta^{(p)}_i & -\sin\theta^{(p)}_i \ \sin\theta^{(p)}_i & \cos\theta^{(p)}_i \end{pmatrix} \begin{pmatrix} q_{2i} \ q_{2i+1} \end{pmatrix}$

For 3D RoPE, denoting positions as triple $(p_t, p_h, p_w)$ : $\theta^{(t)}_{i} = \frac{p_t}{10000^{2i/(d/3)}}, \quad \theta^{(h)}_{i} = \frac{p_h}{10000^{2i/(d/3)}}, \quad \theta^{(w)}_{i} = \frac{p_w}{10000^{2i/(d/3)}}$ The head vector $x \in \mathbb R^d$ decomposes as $x = [x^t, x^h, x^w]$ , with each $x^{[\cdot]} \in \mathbb R^{d/3}$ . The 3D RoPE transformation is applied as a composition of three planar rotations: $\hat x^t = R(\theta^{(t)})\,x^t, \quad \hat x^h = R(\theta^{(h)})\,x^h, \quad \hat x^w = R(\theta^{(w)})\,x^w$ where

$R(\theta)\bigl([a, b]\bigr) = [ a\cos\theta - b\sin\theta,\; a\sin\theta + b\cos\theta ]$

Finally, the head vector is reconstructed as

$\mathrm{RoPE}_{3D}(x;\,p_t,p_h,p_w)= [\,\hat x^t,\,\hat x^h,\,\hat x^w\,].$

3. Integration with Multi-Head Self-Attention and MoE

Query and key projections $Q, K \in \mathbb R^{B \times N \times d}$ for a batch of $B$ samples and $N$ tokens are assigned 3D positions $(p_t^n, p_h^n, p_w^n)$ according to modality. The 3D RoPE transformation is applied independently to each token’s Q and K vectors within every attention head: $\hat Q_{b,n} = \mathrm{RoPE}_{3D}(Q_{b,n};\,p^n_t,p^n_h,p^n_w), \quad \hat K_{b,n} = \mathrm{RoPE}_{3D}(K_{b,n};\,p^n_t,p^n_h,p^n_w)$ Attention is computed as standard: $A_{b,h} = \mathrm{softmax}\Bigl(\frac{1}{\sqrt{d_k}}\,\hat Q_{b,h} \hat K_{b,h}^{\top}\Bigr), \quad \mathrm{Out} = A V$ The routing for Mixture-of-Experts (MoE) experts, including shared, routed, and null experts, is performed after attention and does not involve the 3D RoPE transformation. The application of 3D RoPE affects only the query and key projections entering the attention computation.

4. Assignment of 3D Positional Indices and Efficient Implementation

Assignment of $(p_t, p_h, p_w)$ indices is determined by modality:

Text: $(p_t = \mathtt{token\_idx},\ p_h = 0,\ p_w = 0)$ .
Images: Patches in row-major order; $(p_t=0,\ p_h=\mathtt{row\_idx},\ p_w=\mathtt{col\_idx})$ .
Video: Extract frames every $\Delta$ seconds; within each frame, patch indices as in images. Assign $p_t = f \cdot \theta$ (frame index times angular scaling), $p_h = \mathtt{row}$ , $p_w = \mathtt{col}$ , where $\theta$ aligns temporal and spatial angle units across modalities.
Audio: Chunk audio into fixed-duration segments; for chunk $c$ starting at $t$ seconds: $(p_t = \frac{t}{\Delta}\theta,\ p_h = \frac{t}{\Delta}\theta,\ p_w = \frac{t}{\Delta}\theta)$ , ensuring that only the temporal axis is used for alignment.

Efficient batch computation is realized by precomputing three sinusoid tables $\sin_t,\cos_t \in \mathbb R^{T \times d/3}$ , and analogous tables for $H$ and $W$ . Indexing into these tables by $(p_t, p_h, p_w)$ retrieves the needed trigonometric values, which are then applied via vectorized $2 \times 2$ rotations on paired channels. The overall computational complexity is $O(N \cdot d)$ , and the process can be efficiently fused into a single tensor operation or kernel.

5. Comparative Evaluation and Theoretical Advantages

Omni-Modality 3D RoPE provides substantial advantages over previously used positional encoding schemes:

Encoding Type	Positional Axes	Key Limitation
1D RoPE	Time/token only	No spatial or temporal coupling
2D RoPE	Height × Width	Lacks temporal/sequence indexing
Learned Absolute	Any (learnable)	Lacks rotational/relative properties
M-RoPE (Qwen2-VL)	Space w/ heuristic	No precise temporal scaling
3D RoPE	T × H × W	—

Unlike 1D and 2D RoPE, 3D RoPE captures both spatial and temporal structures, thereby supporting precise alignment between modalities. The rotary design maintains relative displacement invariance, granting superior generalization to longer contexts compared to absolute embeddings. Compared to M-RoPE, 3D RoPE introduces patch-wise spatial traversal and exact absolute-time scaling.

Spatio-temporal proximity is thus preserved in attention: for instance, an audio chunk aligned at $t=12\,\mathrm{s}$ will primarily attend to a video frame sampled near the same instant and spatially co-located image patches when all are jointly processed.

6. Empirical Performance and Significance

Ablation experiments demonstrate that replacing 3D RoPE with 1D RoPE or learned absolute embeddings causes a drop of approximately 6–8 points on video-centric benchmarks (e.g., Video-MME, OmniVideoBench), which is commensurate with the +7% average improvement over prior omnimodal large models observed on eight video benchmarks. Explicitly modeling all three axes of position is critical for strong spatio-temporal cross-modal performance in Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025).

A plausible implication is that future omnimodal transformer systems requiring joint spatio-temporal and cross-modal reasoning will benefit from adopting the Omni-Modality 3D RoPE scheme, as it enables accurate alignment and information fusion at the token level across text, images, audio, and video.

PDF Markdown Chat (Pro)

References (1)

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data (2025)

Follow Topic

Get notified by email when new papers are published related to Omni-Modality 3D RoPE.