- The paper introduces a novel 3D shape conditioning strategy using global and local latent features for improved multi-view consistency in orbital video generation.
- It employs a multi-scale adapter within a video diffusion framework, achieving superior performance in metrics like PSNR, SSIM, LPIPS, and CLIP-S compared to baselines.
- Experimental results demonstrate enhanced temporal stability and geometric realism, paving the way for self-supervised tasks and scalable integration with future video models.
Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
Introduction
The paper "Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors" (2604.12309) addresses a critical challenge in video generation: producing temporally consistent and geometrically plausible multi-view orbital videos from a single image of an object. Prior approaches relying primarily on pixelwise attention or 2.5D priors (e.g., depth or normal maps) have demonstrated limited effectiveness for substantial viewpoint extrapolations, notably in the synthesis of unseen object parts such as rear views. This paper introduces a novel shape conditioning strategy by integrating multi-scale latent features from large-scale 3D foundation models into diffusion-based video generation frameworks.
Methodology
Base Video Diffusion Model
The authors employ a base video diffusion model—specifically, a U-Net-structured Video Diffusion Transformer akin to SVD [4]—for temporal synthesis. Conditioning inputs include CLIP image embeddings, explicit camera pose encodings via sinusoidal embeddings, and noise timestep information. While these inputs enable basic view control and semantic anchoring, they do not suffice to regularize severe geometric extrapolation or enforce shape consistency for unseen surfaces, as evidenced in prior works.
3D Foundation Priors
To bridge this gap, shape priors encoded in the latent space of a pretrained 3D generative model (specifically, Hunyuan3D [55]) are injected. Two levels of latent features are utilized:
- Global Latent Vector: Provides holistic structural guidance, capturing complete object geometry. Extracted by denoising a Gaussian vector using a rectified flow.
- Local Volumetric Latents: Grid-sampled features representing fine-grained, view-dependent geometry, queried and projected into canonical camera views for precise local regularization.
These latent representations, extracted without explicit mesh or texture reconstruction, are compact and efficiently integrated, addressing shortcomings of prior 2.5D or explicit 3D methods in both inference speed and expressivity.
Multi-Scale 3D Adapter
A multi-scale adapter injects the global and local 3D priors into each block of the video diffusion backbone via alternating cross-attention layers. The adapter architecture preserves spatiotemporal dependencies inherent in video backbones and maintains compatibility with parameter-efficient plug-and-play integration. The global prior is shared across frames to enforce object-level coherence, while the projected local features condition each frame to inject precise viewpoint-dependent information.
Experimental Results
Quantitative and Qualitative Evaluation
Comprehensive evaluations on Objaverse-XL [9] and GSO [10] benchmarks, comparing against state-of-the-art orbital video generation (SV3D [42], Hi3D [47]), NVS (Wonder3D [25], Era3D [21]), and 3D asset generation (Hunyuan3D [55], Trellis [46]) methods, demonstrate the effectiveness of the proposed conditioning schema. The method achieves notable improvements in PSNR, SSIM, LPIPS, and CLIP-S metrics, outperforming all baselines, with the most substantial gains in multi-view consistency (lower MEt3R) and shape realism.
Qualitative analysis indicates significant reductions in geometric distortion and temporal artifacts, particularly for nontrivial viewpoint changes and occluded/rear regions. Results on in-the-wild images further confirm the model's robustness and generalization capability. Failure modes are primarily attributed to limited texture representation capacity in the base video model and the 3D prior.
Ablation Studies
Ablative experiments validate that both the global and local priors contribute to superior performance; utilizing cross-attention for injection is imperative, as feature concatenation or input stacking severely degrade results due to misalignment of feature spaces and disruption of pretrained backbone statistics. Conditioning via cross-attention maintains the stochastic generative capabilities of the model while enforcing 3D consistency.
Efficiency
Inference analysis shows minimal computational overhead for the 3D prior extraction and injection modules, enabling practical deployment scenarios with improved quality and manageable resources.
Implications and Future Directions
This work positions 3D foundation priors as a scalable and generalizable method for imposing geometric consistency across video and view synthesis tasks. The approach offers several practical and theoretical implications:
- Regularization for View Extrapolation: By utilizing shape priors in the native 3D latent space, the method resolves under-determined inverse problems inherent to single-image multi-view synthesis, mitigating hallucinations and structural inconsistencies.
- Modularity and Adaptability: The plug-and-play adapter structure allows for integration with future large-scale video diffusion models and alternative 3D priors. Future work may investigate more expressive texture priors, higher spatial resolution backbone architectures, and faster diffusion/inference via native 3D diffusion-based rendering.
- Downstream Utility: Generated orbital videos enable weakly supervised or self-supervised downstream tasks, such as object-centric video understanding, online asset creation for e-commerce/AR/VR, and more robust image-to-3D reconstruction pipelines.
- Limitations: The current pipeline does not address synthesis of high-quality unobserved textures, and is constrained by the base video model's generative resolution. Exploration of cross-modal supervision from text or high-fidelity PBR texture estimation remains open.
Conclusion
The integration of multi-scale 3D foundation priors into video diffusion models, as established in this work, substantially enhances the realism, multi-view consistency, and practical applicability for orbital video generation from single images. The approach sets a paradigm for 3D-aware video synthesis, suggesting future evolution toward full 3D- and appearance-aware generative video architectures.