- The paper proposes a two-stage global-to-local pipeline, starting with a preview for coarse panoramic trajectory-conditioned generation and refined high-fidelity synthesis through segment-wise conditional diffusion.
- The methodology leverages decomposed trajectory conditioning signals ("flow" and "scale") and a panoramic coordinate system that decouples translation from rotation, ensuring robust spatial consistency over long horizons.
- Quantitative evaluations show significant improvements over baseline models in loop consistency and trajectory controllability, enabling practical applications in AR/VR and 3D scene reconstruction.
OmniRoam: Long-Horizon Panoramic Video Generation for Controllable Scene Wandering
Introduction and Motivation
OmniRoam addresses the task of camera-controllable, long-horizon panoramic video generation. Existing generative models for scene synthesis predominantly operate on perspective videos, restricting field-of-view and undermining the capacity for global, structure-consistent scene exploration due to cumulative errors and limited spatial memory. Panoramic video generation, which preserves holistic geometric context, has only recently been explored, and prior art often suffers from inadequate architectural adaptation and fails to leverage the inherent spatial advantages of panoramic representations. The main objective of OmniRoam is to synthesize immersive and coherent panoramic sequences that enable extended, arbitrary camera trajectories—effectively supporting world wandering—with explicit and scalable control over camera path.
Method: Global-to-Local Generation via Preview and Refine
OmniRoam advances a two-stage global-to-local generation pipeline that distinctly separates coarse global structure modeling from subsequent high-fidelity refinement (Figure 1). In the Preview Stage, an acceleration-controlled diffusion backbone generates a mid-resolution trajectory-conditioned panoramic sequence given an input image or video and user-specified camera trajectory. Trajectory conditioning is decomposed into two orthogonal signals: "flow" (framewise displacement direction sequence) and "scale" (global speed magnitude), enabling fine-grained controllability and efficient global coverage. The frame-dimension concatenation strategy underpins temporal continuity and strict conditioning semantics.
(Figure 1)
Figure 1: The OmniRoam pipeline integrates a preview stage for fast layout traversal and a refinement stage for temporally upsampled, high-quality video synthesis.
The Refine Stage temporally expands and spatially upscales the preview output through segment-wise conditional diffusion. Unlike conventional direct generation—which suffers from memory and compute bottlenecks on long sequences—this design leverages scale-alignment and visibility masking to synchronize each refined segment with relevant preview anchors, ensuring global consistency and minimizing local temporal drift. Both stages are implemented as finetunes over Diffusion Transformers and benefit from the hierarchical separation of global and local scene priors.
Dataset and Canonical Representation
OmniRoam introduces a scalable, hybrid panoramic dataset encompassing both real captured video and synthetic sequences rendered from 3D Gaussian Splatting (3DGS) scenes. The canonical panoramic coordinate system decouples rotation (roll/pitch/yaw) from translation, modeling trajectories as sequences of translational displacements in a rotation-invariant manner. This design both simplifies ground-truth supervision—since rotation in equirectangular projection reduces to cyclic pixel shifts—and ensures that the generator must encode all spatial context in the translation space, closely mirroring real-world scene wandering.
Evaluation Metrics and Benchmarking
Quantitative evaluation incorporates three axes:
- Visual Quality: Measured by FAED, SSIM, and LPIPS in both short (81-frame) and long (641-frame) generations.
- Trajectory Controllability: Assessed using PSNR computed over pre-defined temporal windows, evaluating adherence to complex, out-of-distribution camera paths.
- Long-Term Spatial Consistency: A novel loop consistency metric, which quantifies the ability of the generated video to return to an initial view upon completion of a closed-loop camera path—an explicit assessment of global coherence and memory retention.
Experimental Results
OmniRoam demonstrates substantial gains over contemporaneous panoramic and perspective baseline models (Matrix-3D, Imagine360), with clear improvements in all evaluation dimensions. For instance, loop consistency (full sequence, 641 frames) reaches 1.96 for OmniRoam, versus 1.41 for Matrix-3D at comparable resolution. Trajectory controllability PSNRs remain high even after 600+ frames, indicating minimal drift and accurate adherence to user specification (Table 1).
(Figure 2)
Figure 2: Qualitative comparisons reveal that OmniRoam preserves semantic and geometric structure over long trajectories, with pronounced fidelity in object boundaries and reduced artifact prevalence compared to prior work.
The ablation study rigorously isolates the contributions of panoramic representation and the global-to-local pipeline. Both perspective-based and naive autoregressive variants exhibit severe performance collapse in long-horizon, high-temporal-resolution settings, with loop consistency scores nearly halved and late-sequence PSNR dropping below operational thresholds.
Extensions: Real-Time Preview and 3D Scene Reconstruction
OmniRoam further advances real-time generation capability by introducing a distilled autoregressive previewer via self-forcing distribution matching. This student network produces 81-frame videos in under 10 seconds—three orders of magnitude faster than diffusion-based generation—while retaining layout and global consistency (Figure 3).
(Figure 3)
Figure 3: The real-time previewer enables interactive trajectory-controlled scene wandering with rapid turnaround times.
Additionally, the framework supports downstream 3D scene generation: Dense, loop-consistent panoramic trajectories are sampled into perspective views, which are directly used as input for 3D Gaussian Splatting reconstruction. This demonstrates that OmniRoam-generated content is not only visually coherent but also geometrically valid across extended paths, unlocking practical use cases in AR/VR and environment modeling.
Implications and Future Directions
OmniRoam operationalizes the representational advantage of panoramic vision within scene-level video generation. The global-to-local factorization not only facilitates scalable, arbitrarily long, and controllable wandering but also provides a foundation for future integration with interactive and editable 3D world models. The explicit decoupling of translation and rotation, coupled with hierarchical conditioning, enables further research in universal world models, editable scene synthesis, and cross-modal alignment (e.g., text-to-panorama).
Given the emergence of holistic trajectory-conditioned and globally consistent generation pipelines demonstrated here, future developments may integrate real-time control, multi-agent exploration of generated scenes, and end-to-end differentiable 3D world learning leveraging synthetic wandering for self-supervision.
Conclusion
OmniRoam presents a principled, scalable framework for long-horizon panoramic video generation, facilitating high-fidelity and explicit camera-controllable world wandering. Through panoramic representation, decomposed camera conditioning, and staged global-to-local generation, OmniRoam achieves state-of-the-art performance in visual quality, trajectory-following, and global spatial coherence. The system further underpins practical extensions to real-time generation and 3D scene reconstruction, substantiating panoramic video synthesis as a core substrate for scene-level modeling and environment generation (2603.30045).