Unified Camera-Frame RoPE

Updated 24 March 2026

The paper introduces a unified camera-frame RoPE that embeds camera pose and stereo geometry into attention mechanisms, significantly improving convergence and geometric fidelity.
It decomposes attention into intra-view and epipolar row components to enforce stereo consistency and accurate viewpoint alignment across time and space.
Empirical results demonstrate reduced computational cost and enhanced performance in stereo video synthesis, VR rendering, and embodied perception tasks.

A unified camera-frame Rotary Positional Encoding (RoPE) represents an architectural technique for embedding intrinsic camera pose and stereo geometry into attention mechanisms, enabling generative and discriminative models to capture temporally consistent, stereo-coherent, and viewpoint-aware correspondences. This approach extends standard RoPE from language and monocular vision models into the camera-centric coordinate frame, crucial for enabling joint attention across space, time, and camera viewpoint in tasks such as stereo video synthesis, binocular rendering, and multi-view perceptual geometry. By coupling this encoding with epipolar-structured attention decomposition, unified camera-frame RoPE provides both theoretical and empirical efficiency and quality gains over prior methods that lack camera or geometric grounding.

1. Architectural Overview and Mathematical Formulation

Unified camera-frame RoPE, as introduced within the StereoWorld model, operates on stacked latent representations $f^{in} \in \mathbb{R}^{B \times 2F \times H \times W \times C}$ , where $B$ is batch size, $F$ is number of frames, $H \times W$ are the spatial dimensions, 2 indexes the stereo view (left/right), and $C$ is channel count (Sun et al., 18 Mar 2026). Each token’s positional index is mapped to a rotary embedding that incorporates both its spatiotemporal index and its camera pose, typically derived relative to a canonical world or rig-centric reference.

The embedding is injected at the attention-projection stage, embedding information such as absolute or relative camera matrix parameters, subpixel stereo shifts, and any global camera motion into the learned angular phase of each token. The rotary mechanism multiplies the query and key vectors by fixpoint-computed sine and cosine factors corresponding to the specified position and camera pose:

$(Q, K) \rightarrow (\tilde Q, \tilde K) ; \quad \tilde Q = Q \cdot \text{RoPE}_{\text{cam}}(pos, cam), \quad \tilde K = K \cdot \text{RoPE}_{\text{cam}}(pos, cam)$

where $\text{RoPE}_{\text{cam}}$ denotes the camera-aware rotary embedding function. In StereoWorld this is instantiated using a "copy-init" from a pretrained video RoPE, augmented with camera pose inputs [(Sun et al., 18 Mar 2026): Tab. 5], allowing for stable initialization and consistent learning of new camera-coherent signals.

2. Stereo-Aware Attention Decomposition

A unified camera-frame RoPE enables explicit separation of intra-view and cross-view dependencies through a two-branch attention decomposition:

3D intra-view attention: Standard multi-head self-attention operates independently on left and right streams, capturing temporal and spatial context within each camera.
Row (epipolar) attention: For each scanline (constant $y$ ), features across both left and right views at each time and row are gathered, and attention operates exclusively within these $2W$-sized tokens (where $W$ is width).

This split enforces the epipolar constraint: feature matching is restricted along scanlines, reflecting the underlying binocular geometry, while still enabling each head to access positional information about camera pose via the RoPE. The attention output composes back as

$f^{out} = f^{in} + \text{Attn}_{3D}(f^{in}) + \text{Attn}_{\text{row}}(f^{in})$

This method leverages the camera-frame RoPE to maintain relative pose awareness, aligning stereo rows across time and viewpoint, and supporting geometric consistency without requiring explicit external depth computation (Sun et al., 18 Mar 2026).

3. Advantages Over Non-Camera-Conditioned Positional Encoding

Standard positional encoding—1D or 2D—fails to reflect camera motion, relative rig pose, or stereo displacement, leading to domain/scene-dependent inconsistencies in synthetic video or stereo generation. The camera-frame RoPE directly encodes these extrinsic parameters, rendering the attention mechanism aware of viewpoint transitions, absolute and relative camera shifts, and the cyclical/phase nature of temporal and spatial context.

The empirical advantage is demonstrated in both faster convergence and improved generalization, as evidenced by ablation studies: with copy-initialized, camera-aware RoPE, StereoWorld attains superior camera-motion fidelity (rotation error $1.16^\circ$ vs $1.81^\circ$ for zero-init), lower FID (122.41 vs 131.07), and improved view-consistency at reduced computational cost [(Sun et al., 18 Mar 2026): Tab. 5, 6]. These gains are consistently observed across stereo synthesis, VR rendering, and embodied perception.

4. Computational Complexity and Efficiency

Unified camera-frame RoPE, in conjunction with stereo attention decomposition, yields a substantial reduction in attention computation. Full joint 4D attention over all time, space, and view tokens incurs a $O((2FHW)^2 d)$ cost per head. The decomposition splits this into intra-view $O(2(FHW)^2d)$ and epipolar row $O(FH(2W)^2d)$ , reducing overall FLOPs by approximately half for typical scene sizes (e.g., $3.11 \times 10^{10}$ $\rightarrow 1.56 \times 10^{10}$ for $F=13, H=15, W=20, d=128$ ) and increasing throughput by $\sim45\%$ without degrading perceptual or temporal coherence [(Sun et al., 18 Mar 2026): Tab. 6].

These savings are only achievable because camera-frame RoPE keeps positional context consistent under spatial, temporal, and viewpoint transformations, avoiding the need for repeated re-projection or expensive global attention over non-matching keys.

5. Empirical Outcomes and Application Scope

Unified camera-frame RoPE, as deployed in StereoWorld, enhances multiple performance axes:

Viewpoint consistency: +5% absolute gain in CLIP-V metrics vs. monocular-then-convert approaches.
Generation speed: $\sim3\times$ improvement over non-decomposed architectures at comparable or lower computational load.
Disparity fidelity: Directly learns disparity-aligned geometry from RGB without auxiliary depth supervision, yielding artifact-free disparity maps compared to RGB-D world models.
Camera-motion fidelity: Reduces error in camera trajectory estimation (RotErr, TransErr) in generation benchmarks.

These advances enable novel capabilities: real-time binocular VR rendering from monocular input, geometry-coherent stereo video synthesis, direct policy learning for embodied agents with metric-scale spatial alignment, and compatibility with high-resolution, long-horizon video synthesis (Sun et al., 18 Mar 2026).

6. Position within the Broader Stereo and Attention-Modulated Vision Literature

Unified camera-frame RoPE complements a broader class of stereo-aware attention designs that leverage epipolar or geometric priors, including the mutual epipolar attention in H-Net (Huang et al., 2021), stereo cross-attention pipelined row-wise in ECSIC (Wödlinger et al., 2023), and bi-directional epipolar cross-attention in GREAT (Li et al., 19 Sep 2025). Unlike these approaches, which primarily operate within discriminative frameworks or cost-volume structures, the camera-frame RoPE generalizes explicit pose conditioning into generative token-based attention, supporting seamless integration with autoregressive and video foundation models.

A plausible implication is that unified camera-frame RoPE may further enable cross-modal or multi-agent scene synthesis where viewpoint and temporal signal must remain fully consistent across distributed observation or action sequences.

7. Limitations and Open Directions

Current unified camera-frame RoPE designs assume rectified stereo input and accurate pose calibration; unmodeled deviations or non-epipolar noise may limit efficacy. Furthermore, while RoPE is inherently equivariant to group operations (translation, rotation), real-world camera rigs may introduce non-linear optical distortions or varying photometric conditions not fully captured by sinusoidal encoding. Future research may refine the embedding to incorporate higher-order camera parameters, extend to arbitrary multi-view settings, or jointly optimize RoPE with downstream geometric consistency objectives.

Additionally, while empirical evidence shows near parity in stereo and temporal alignment with full 4D attention at half the compute, small discrepancies ( $\sim1\%$ in CLIP-T/V) may warrant further architectural or loss-modulated adjustments in highly sensitive downstream applications (Sun et al., 18 Mar 2026).

In summary, unified camera-frame RoPE provides a scalable, geometry-grounded positional encoding framework that, when embedded in epipolar-structured attention decomposition, significantly advances stereo video modeling, viewpoint-aware generation, and geometry-coherent perception. Its core advantages derive from principled coupling of viewpoint geometry and attention, yielding both theoretical clarity and practical efficiency (Sun et al., 18 Mar 2026).