Trajectory-Guided Panoramic Video Diffusion
- The paper presents a dual diffusion framework that integrates trajectory and scene cues to produce high-fidelity panoramic videos with robust geometric and temporal consistency.
- It utilizes advanced model architectures, including U-Net and Transformer-based denoisers with spherical epipolar attention, to correct equirectangular distortions and ensure controlled camera motion.
- Experimental results demonstrate significant improvements in metrics such as PSNR and LPIPS, underlining the efficacy of trajectory conditioning and specialized training strategies.
Trajectory-guided panoramic video diffusion refers to the class of generative frameworks that synthesize high-fidelity panoramic (often 360°) videos explicitly conditioned on a user-specified camera trajectory. This paradigm goes beyond classical single-image view synthesis by leveraging the spatiotemporal structure of video diffusion models and integrating precise geometric or semantic guidance across the sequence, ensuring geometric correctness, temporal consistency, and controllable camera motion in panoramic settings (Kwak et al., 2023, Ye et al., 31 Oct 2024, Ji et al., 24 Sep 2025). Methods in this domain typically operate in equirectangular projection and introduce distinct architectural and conditioning innovations to address the unique challenges of panoramic and trajectory-aware video generation.
1. Model Architectures and Conditioning Mechanisms
Trajectory-guided panoramic video diffusion models are invariably constructed on the latent diffusion framework, deploying either U-Net or Transformer-based denoisers acting on compressed video latents (Kwak et al., 2023, Ji et al., 24 Sep 2025, Yin et al., 29 Sep 2025). Central to these architectures is the multifaceted conditioning that injects trajectory and/or scene structural information:
- Dual Model Fusion: Some systems combine a view-conditioned latent diffusion model (e.g., extended Zero-1-to-3 XL) that provides explicit 3D-aware per-frame guidance with a pre-trained video backbone (e.g., ZeroScope) that enforces spatiotemporal and appearance continuity. The noise predictions from both denoisers are linearly combined during sampling, with time-varying weighting favoring video consistency at early diffusion steps and fine geometric detail at late steps (Kwak et al., 2023).
- Plücker Embeddings and Pose Encoders: Approaches such as CamPVG and PanoWorld-X inject camera trajectory information at the pixel or feature level using Plücker coordinates derived from spherical projections of each pixel in each frame, concatenated across time and space (Ji et al., 24 Sep 2025, Yin et al., 29 Sep 2025). A compact pose encoder is typically trained to project these six-dimensional embeddings into the latent space, either concatenated or fused via FiLM or cross-attention.
- Spherical Epipolar or Sphere-Aware Attention: Cross-view or cross-temporal consistency is enabled by explicitly constructing attention masks along great-circle epipolar lines in equirectangular space or by deploying sphere-aware transformer blocks that use spherical distances as attention gating, correcting for geometric distortion and adjacency in the 360° domain (Ye et al., 31 Oct 2024, Ji et al., 24 Sep 2025, Yin et al., 29 Sep 2025).
- Mesh-Conditioned or Structure-Conditioned Diffusion: Some frameworks (e.g., Matrix-3D, VideoFrom3D) generate per-frame 3D mesh or edge map renders from a coarse reconstruction, providing these as additional conditions to the diffusion model through concatenation or cross-attention (Yang et al., 11 Aug 2025, Kim et al., 22 Sep 2025).
2. Mathematical Formulation and Diffusion Process
All trajectory-guided panoramic video diffusion frameworks inherit the noise-injection and denoising processes from DDPMs or related latent SDEs (Kwak et al., 2023, Ye et al., 31 Oct 2024, Pan et al., 21 Jun 2025, Ji et al., 24 Sep 2025, Yin et al., 29 Sep 2025):
- Forward Process: Given a clean latent sequence (with the number of trajectory frames), noise is added via
- Reverse Process: At each diffusion step, the denoiser(s) produce noise estimates which are used to iteratively map
where may be a sum or weighted combination of image- and video-conditioned denoisers, or produced by a unified, effector incorporating trajectory features (Kwak et al., 2023, Ji et al., 24 Sep 2025).
- Loss Functions: Training is generally performed via denoising score matching loss:
possibly augmented by explicit multi-view geometric consistency losses or flow-based motion losses for increased cross-frame coherence (Ye et al., 31 Oct 2024, Lei et al., 25 Sep 2025).
- Trajectory Embedding: Camera trajectories are parameterized as sequences of SE(3) poses or 6D Plücker embeddings. Spherical linear interpolation (SLERP) and linear translation are often used to sample smooth scanpaths (Kwak et al., 2023).
3. Spherical and Panoramic-Specific Mechanisms
Accurate geometric modeling of equirectangular and 360° imagery requires explicit handling of spherical geometry, cyclic boundary conditions, and distortion-aware feature propagation:
- Spherical Epipolar Attention: Systems like DiffPano and CamPVG construct great-circle epipolar lines based on relative camera poses. These curves serve as pathways for cross-view attention and feature matching, enforcing geometric correctness (Ye et al., 31 Oct 2024, Ji et al., 24 Sep 2025).
- Sphere-Aware Attention and Token Reprojection: PanoWorld-X integrates a sphere-aware DiT block that back-projects equirectangular pixel locations into spherical (, ) coordinates, computes great-circle distances between points, and modulates transformer attention using binary masks or gating functions based on these distances. This corrects for adjacency distortions and aligns self-attention with actual spatial relationships (Yin et al., 29 Sep 2025).
- Offset-Shifting and Rotating Windows: To achieve seamless, arbitrarily large panoramic generation with efficient VRAM usage, DynamicScaler employs a windowed denoising strategy in which a fixed-size spatial patch is swept, with boundary noise carefully managed and cyclicity enforced at the left/right edges (Liu et al., 15 Dec 2024).
4. Trajectory Guidance, Conditioning, and Control
Precise camera control and trajectory adherence are realized by:
- Plücker or Pose Embedding: Direct per-pixel or per-token encoding of the scene pose for each frame enables explicit trajectory control. These encodings can be updated dynamically based on the position in the user-specified scan (Ji et al., 24 Sep 2025, Yin et al., 29 Sep 2025, Ye et al., 31 Oct 2024).
- Motion Map Prediction: Models such as MotionFlow learn to predict implicit flow maps for each pixel, integrating the effect of camera and object motion into dense spatial conditioning, enabling applications with both moving camera and dynamic scenes (Lei et al., 25 Sep 2025).
- Classifier-Free and Trajectory Guidance in Attention: Guidance scales for conditioning branches and explicit trajectory, text, or object embeddings are applied via cross-attention at multiple network stages. Classifier-free guidance enables selective amplification of conditioning signals for sharper trajectory control (Kwak et al., 2023, Pan et al., 21 Jun 2025).
5. Training Datasets, Procedures, and Evaluation
Trajectory-guided panoramic video diffusion models require specialized training datasets and evaluation metrics:
- Synthetic and Real-World Datasets: Large-scale datasets such as PanoExplorer (Unreal-rendered 360° videos and precise camera paths), HM3D+Habitat panoramas, and Google Scanned Objects are used for supervised learning and evaluation (Yin et al., 29 Sep 2025, Ye et al., 31 Oct 2024, Kwak et al., 2023, Yang et al., 11 Aug 2025).
- Two-Stage Training: Typical pipelines train a panorama or image generator (often with LoRA adapters) before fine-tuning a trajectory- or multi-view-aware model, with stages focused on small- and large-motion trajectory segments to enforce both local consistency and global novelty (Ye et al., 31 Oct 2024, Yang et al., 11 Aug 2025).
- Metrics: Standard metrics include PSNR, SSIM, LPIPS, FID, FVD, optical flow outlier ratios (FOR), and custom metrics such as FAED for equirectangular distortion or mTSED for trajectory alignment (Kwak et al., 2023, Ye et al., 31 Oct 2024, Liu et al., 15 Dec 2024, Yin et al., 29 Sep 2025, Ji et al., 24 Sep 2025). User studies on left-right seam continuity, image quality, and motion fidelity are also common.
| Model / Metric | PSNR (↑) | SSIM (↑) | LPIPS (↓) | FID (↓) | FVD (↓) |
|---|---|---|---|---|---|
| CamPVG (Ji et al., 24 Sep 2025) | 30.05 | 0.6544 | 0.1480 | - | ~66 |
| PanoWorld-X (Yin et al., 29 Sep 2025) | 19.34 | 0.63 | 0.24 | 28.01 | 467 |
| Matrix-3D 720p (Yang et al., 11 Aug 2025) | 23.9 | 0.747 | 0.0907 | 11.3 | 140 |
| DiffPano (Ye et al., 31 Oct 2024) | - | 0.87 | - | ~48 (1-view) | - |
6. Representative Methods and Comparative Advances
- ViVid-1-to-3 (Kwak et al., 2023): Proposes dual diffusion with trajectory-based scan synthesis for high-fidelity, geometrically consistent view generation. Outperforms Zero-1-to-3 XL by ∼1dB PSNR and 0.002 LPIPS.
- DiffPano (Ye et al., 31 Oct 2024): Introduces spherical epipolar-aware attention and two-stage LoRA fine-tuning to robustly enforce multi-view consistency and seamlessness over arbitrary 6-DoF camera trajectories.
- CamPVG (Ji et al., 24 Sep 2025): Advances precise camera-controlled panoramic video diffusion by integrating Plücker embeddings for pose and spherical epipolar attention, yielding best-in-class PSNR and LPIPS on standard 360° datasets.
- PanoWorld-X (Yin et al., 29 Sep 2025): Delivers geometric fidelity and camera controllability in panoramic video via a sphere-aware DiT, with Plücker trajectory embeddings and spherical attention.
- DynamicScaler (Liu et al., 15 Dec 2024): Achieves resolution-invariant panoramic scene synthesis with the offset-shifting denoiser and rotating window method, substantially improving scene richness and temporal coherence in high-resolution regimes.
7. Limitations and Future Directions
While trajectory-guided panoramic video diffusion enables high-quality, controllable 360° generation, several limitations remain:
- Dataset Bias and Generalization: Many models rely on synthetic scenes or limited real benchmarks; generalization to complex real-world outdoor panoramas is underexplored (Ji et al., 24 Sep 2025).
- Sequence Length Restrictions: Inference and memory constraints often limit video length (e.g., PanoWorld-X to 49 frames), suggesting the need for more memory- and compute-efficient architectures (Yin et al., 29 Sep 2025).
- Explicit Dynamics: While object dynamics can be addressed by temporal MLLM prompt generation (e.g., DreamJourney (Pan et al., 21 Jun 2025)), the dominant direction remains static scene traversal. Expanding to dynamic-object, eventful worlds is an open area.
A plausible implication is that further integration of geometric priors (meshes, depth, semantics), expansion of large-scale real scene datasets, and improved efficiency/scalability of denoisers will drive the next advances in panoramic video diffusion and spatial intelligence.