Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Omni-Modality 3D RoPE

Updated 18 November 2025
  • Omni-Modality 3D RoPE is a unified positional encoding scheme that integrates time, height, and width to support effective cross-modal reasoning.
  • It applies independent sine-cosine planar rotations on token subcomponents to preserve relative positional invariance across text, image, audio, and video data.
  • Empirical results show a +7% improvement over previous models on video benchmarks, underscoring its impact on spatio-temporal data alignment.

Omni-Modality 3D Rotary Positional Encoding (3D RoPE) is a unified positional encoding scheme introduced in the Uni-MoE-2.0-Omni model to support spatio-temporal cross-modality alignment within large-scale omnimodal transformer architectures. Unlike conventional 1D or 2D positional encodings, 3D RoPE enables a transformer to jointly process text, image, audio, and video data by embedding position along orthogonal axes of time, height, and width directly into the multi-head self-attention mechanism. This method preserves the relative positional properties necessary for effective cross-modal reasoning and long-context understanding while providing a shared alignment space across previously disparate modalities (Li et al., 16 Nov 2025).

1. Motivation for 3D RoPE in Omnimodal Transformers

Traditional 1D rotary or absolute positional encodings are limited to sequential time or token axis indexing and are suitable mainly for text. They fail to capture spatial structure (as in images, which require both height and width), and omit temporal structure necessary for video and audio streams. 2D RoPE extends to cover two spatial dimensions, but still omits true temporal encoding, which leads to misalignment between temporal progression (such as video frames or audio chunks) and spatial locations (such as image or video patches).

The objectives of 3D RoPE are:

  • To provide a modality-agnostic positional embedding applicable to text, images, audio, and video, parameterized by three orthogonal axes: Time (TT), Height (HH), Width (WW).
  • To preserve the relative-position property of rotary embeddings, where inner-product shifts remain consistent for relative positional changes.
  • To enable spatio-temporal cross-modal alignment, ensuring, for example, that a video patch at t=10st=10\,\mathrm{s} aligns meaningfully with the co-occurring audio chunk covering 1013s10\text{--}13\,\mathrm{s}.

2. Mathematical Formulation

Let dd be the per-head dimensionality and assume dd is divisible by $6$. Each attention head’s input vector is decomposed into three equal-size chunks, corresponding to TT, HH, and WW, and each chunk is further split into paired subcomponents. Standard 2-dimensional rotary transformations (sine/cosine pairwise rotations) are applied independently to each.

For 1D RoPE (single axis), with position pp and channel pair (q2i,q2i+1)(q_{2i}, q_{2i+1}): θi(p)=p100002i/d,(q^2i q^2i+1)=(cosθi(p)sinθi(p) sinθi(p)cosθi(p))(q2i q2i+1)\theta^{(p)}_i = \frac{p}{10000^{2i/d}}, \quad \begin{pmatrix} \hat q_{2i} \ \hat q_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos\theta^{(p)}_i & -\sin\theta^{(p)}_i \ \sin\theta^{(p)}_i & \cos\theta^{(p)}_i \end{pmatrix} \begin{pmatrix} q_{2i} \ q_{2i+1} \end{pmatrix}

For 3D RoPE, denoting positions as triple (pt,ph,pw)(p_t, p_h, p_w): θi(t)=pt100002i/(d/3),θi(h)=ph100002i/(d/3),θi(w)=pw100002i/(d/3)\theta^{(t)}_{i} = \frac{p_t}{10000^{2i/(d/3)}}, \quad \theta^{(h)}_{i} = \frac{p_h}{10000^{2i/(d/3)}}, \quad \theta^{(w)}_{i} = \frac{p_w}{10000^{2i/(d/3)}} The head vector xRdx \in \mathbb R^d decomposes as x=[xt,xh,xw]x = [x^t, x^h, x^w], with each x[]Rd/3x^{[\cdot]} \in \mathbb R^{d/3}. The 3D RoPE transformation is applied as a composition of three planar rotations: x^t=R(θ(t))xt,x^h=R(θ(h))xh,x^w=R(θ(w))xw\hat x^t = R(\theta^{(t)})\,x^t, \quad \hat x^h = R(\theta^{(h)})\,x^h, \quad \hat x^w = R(\theta^{(w)})\,x^w where

R(θ)([a,b])=[acosθbsinθ,  asinθ+bcosθ]R(\theta)\bigl([a, b]\bigr) = [ a\cos\theta - b\sin\theta,\; a\sin\theta + b\cos\theta ]

Finally, the head vector is reconstructed as

RoPE3D(x;pt,ph,pw)=[x^t,x^h,x^w].\mathrm{RoPE}_{3D}(x;\,p_t,p_h,p_w)= [\,\hat x^t,\,\hat x^h,\,\hat x^w\,].

3. Integration with Multi-Head Self-Attention and MoE

Query and key projections Q,KRB×N×dQ, K \in \mathbb R^{B \times N \times d} for a batch of BB samples and NN tokens are assigned 3D positions (ptn,phn,pwn)(p_t^n, p_h^n, p_w^n) according to modality. The 3D RoPE transformation is applied independently to each token’s Q and K vectors within every attention head: Q^b,n=RoPE3D(Qb,n;ptn,phn,pwn),K^b,n=RoPE3D(Kb,n;ptn,phn,pwn)\hat Q_{b,n} = \mathrm{RoPE}_{3D}(Q_{b,n};\,p^n_t,p^n_h,p^n_w), \quad \hat K_{b,n} = \mathrm{RoPE}_{3D}(K_{b,n};\,p^n_t,p^n_h,p^n_w) Attention is computed as standard: Ab,h=softmax(1dkQ^b,hK^b,h),Out=AVA_{b,h} = \mathrm{softmax}\Bigl(\frac{1}{\sqrt{d_k}}\,\hat Q_{b,h} \hat K_{b,h}^{\top}\Bigr), \quad \mathrm{Out} = A V The routing for Mixture-of-Experts (MoE) experts, including shared, routed, and null experts, is performed after attention and does not involve the 3D RoPE transformation. The application of 3D RoPE affects only the query and key projections entering the attention computation.

4. Assignment of 3D Positional Indices and Efficient Implementation

Assignment of (pt,ph,pw)(p_t, p_h, p_w) indices is determined by modality:

  • Text: (pt=token_idx, ph=0, pw=0)(p_t = \mathtt{token\_idx},\ p_h = 0,\ p_w = 0).
  • Images: Patches in row-major order; (pt=0, ph=row_idx, pw=col_idx)(p_t=0,\ p_h=\mathtt{row\_idx},\ p_w=\mathtt{col\_idx}).
  • Video: Extract frames every Δ\Delta seconds; within each frame, patch indices as in images. Assign pt=fθp_t = f \cdot \theta (frame index times angular scaling), ph=rowp_h = \mathtt{row}, pw=colp_w = \mathtt{col}, where θ\theta aligns temporal and spatial angle units across modalities.
  • Audio: Chunk audio into fixed-duration segments; for chunk cc starting at tt seconds: (pt=tΔθ, ph=tΔθ, pw=tΔθ)(p_t = \frac{t}{\Delta}\theta,\ p_h = \frac{t}{\Delta}\theta,\ p_w = \frac{t}{\Delta}\theta), ensuring that only the temporal axis is used for alignment.

Efficient batch computation is realized by precomputing three sinusoid tables sint,costRT×d/3\sin_t,\cos_t \in \mathbb R^{T \times d/3}, and analogous tables for HH and WW. Indexing into these tables by (pt,ph,pw)(p_t, p_h, p_w) retrieves the needed trigonometric values, which are then applied via vectorized 2×22 \times 2 rotations on paired channels. The overall computational complexity is O(Nd)O(N \cdot d), and the process can be efficiently fused into a single tensor operation or kernel.

5. Comparative Evaluation and Theoretical Advantages

Omni-Modality 3D RoPE provides substantial advantages over previously used positional encoding schemes:

Encoding Type Positional Axes Key Limitation
1D RoPE Time/token only No spatial or temporal coupling
2D RoPE Height × Width Lacks temporal/sequence indexing
Learned Absolute Any (learnable) Lacks rotational/relative properties
M-RoPE (Qwen2-VL) Space w/ heuristic No precise temporal scaling
3D RoPE T × H × W

Unlike 1D and 2D RoPE, 3D RoPE captures both spatial and temporal structures, thereby supporting precise alignment between modalities. The rotary design maintains relative displacement invariance, granting superior generalization to longer contexts compared to absolute embeddings. Compared to M-RoPE, 3D RoPE introduces patch-wise spatial traversal and exact absolute-time scaling.

Spatio-temporal proximity is thus preserved in attention: for instance, an audio chunk aligned at t=12st=12\,\mathrm{s} will primarily attend to a video frame sampled near the same instant and spatially co-located image patches when all are jointly processed.

6. Empirical Performance and Significance

Ablation experiments demonstrate that replacing 3D RoPE with 1D RoPE or learned absolute embeddings causes a drop of approximately 6–8 points on video-centric benchmarks (e.g., Video-MME, OmniVideoBench), which is commensurate with the +7% average improvement over prior omnimodal large models observed on eight video benchmarks. Explicitly modeling all three axes of position is critical for strong spatio-temporal cross-modal performance in Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025).

A plausible implication is that future omnimodal transformer systems requiring joint spatio-temporal and cross-modal reasoning will benefit from adopting the Omni-Modality 3D RoPE scheme, as it enables accurate alignment and information fusion at the token level across text, images, audio, and video.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Omni-Modality 3D RoPE.