Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

Published 14 Apr 2026 in cs.CV | (2604.12309v1)

Abstract: We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel 3D shape conditioning strategy using global and local latent features for improved multi-view consistency in orbital video generation.
It employs a multi-scale adapter within a video diffusion framework, achieving superior performance in metrics like PSNR, SSIM, LPIPS, and CLIP-S compared to baselines.
Experimental results demonstrate enhanced temporal stability and geometric realism, paving the way for self-supervised tasks and scalable integration with future video models.

Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

Introduction

The paper "Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors" (2604.12309) addresses a critical challenge in video generation: producing temporally consistent and geometrically plausible multi-view orbital videos from a single image of an object. Prior approaches relying primarily on pixelwise attention or 2.5D priors (e.g., depth or normal maps) have demonstrated limited effectiveness for substantial viewpoint extrapolations, notably in the synthesis of unseen object parts such as rear views. This paper introduces a novel shape conditioning strategy by integrating multi-scale latent features from large-scale 3D foundation models into diffusion-based video generation frameworks.

Methodology

Base Video Diffusion Model

The authors employ a base video diffusion model—specifically, a U-Net-structured Video Diffusion Transformer akin to SVD [4]—for temporal synthesis. Conditioning inputs include CLIP image embeddings, explicit camera pose encodings via sinusoidal embeddings, and noise timestep information. While these inputs enable basic view control and semantic anchoring, they do not suffice to regularize severe geometric extrapolation or enforce shape consistency for unseen surfaces, as evidenced in prior works.

3D Foundation Priors

To bridge this gap, shape priors encoded in the latent space of a pretrained 3D generative model (specifically, Hunyuan3D [55]) are injected. Two levels of latent features are utilized:

Global Latent Vector: Provides holistic structural guidance, capturing complete object geometry. Extracted by denoising a Gaussian vector using a rectified flow.
Local Volumetric Latents: Grid-sampled features representing fine-grained, view-dependent geometry, queried and projected into canonical camera views for precise local regularization.

These latent representations, extracted without explicit mesh or texture reconstruction, are compact and efficiently integrated, addressing shortcomings of prior 2.5D or explicit 3D methods in both inference speed and expressivity.

Multi-Scale 3D Adapter

A multi-scale adapter injects the global and local 3D priors into each block of the video diffusion backbone via alternating cross-attention layers. The adapter architecture preserves spatiotemporal dependencies inherent in video backbones and maintains compatibility with parameter-efficient plug-and-play integration. The global prior is shared across frames to enforce object-level coherence, while the projected local features condition each frame to inject precise viewpoint-dependent information.

Experimental Results

Quantitative and Qualitative Evaluation

Comprehensive evaluations on Objaverse-XL [9] and GSO [10] benchmarks, comparing against state-of-the-art orbital video generation (SV3D [42], Hi3D [47]), NVS (Wonder3D [25], Era3D [21]), and 3D asset generation (Hunyuan3D [55], Trellis [46]) methods, demonstrate the effectiveness of the proposed conditioning schema. The method achieves notable improvements in PSNR, SSIM, LPIPS, and CLIP-S metrics, outperforming all baselines, with the most substantial gains in multi-view consistency (lower MEt3R) and shape realism.

Qualitative analysis indicates significant reductions in geometric distortion and temporal artifacts, particularly for nontrivial viewpoint changes and occluded/rear regions. Results on in-the-wild images further confirm the model's robustness and generalization capability. Failure modes are primarily attributed to limited texture representation capacity in the base video model and the 3D prior.

Ablation Studies

Ablative experiments validate that both the global and local priors contribute to superior performance; utilizing cross-attention for injection is imperative, as feature concatenation or input stacking severely degrade results due to misalignment of feature spaces and disruption of pretrained backbone statistics. Conditioning via cross-attention maintains the stochastic generative capabilities of the model while enforcing 3D consistency.

Efficiency

Inference analysis shows minimal computational overhead for the 3D prior extraction and injection modules, enabling practical deployment scenarios with improved quality and manageable resources.

Implications and Future Directions

This work positions 3D foundation priors as a scalable and generalizable method for imposing geometric consistency across video and view synthesis tasks. The approach offers several practical and theoretical implications:

Regularization for View Extrapolation: By utilizing shape priors in the native 3D latent space, the method resolves under-determined inverse problems inherent to single-image multi-view synthesis, mitigating hallucinations and structural inconsistencies.
Modularity and Adaptability: The plug-and-play adapter structure allows for integration with future large-scale video diffusion models and alternative 3D priors. Future work may investigate more expressive texture priors, higher spatial resolution backbone architectures, and faster diffusion/inference via native 3D diffusion-based rendering.
Downstream Utility: Generated orbital videos enable weakly supervised or self-supervised downstream tasks, such as object-centric video understanding, online asset creation for e-commerce/AR/VR, and more robust image-to-3D reconstruction pipelines.
Limitations: The current pipeline does not address synthesis of high-quality unobserved textures, and is constrained by the base video model's generative resolution. Exploration of cross-modal supervision from text or high-fidelity PBR texture estimation remains open.

Conclusion

The integration of multi-scale 3D foundation priors into video diffusion models, as established in this work, substantially enhances the realism, multi-view consistency, and practical applicability for orbital video generation from single images. The approach sets a paradigm for 3D-aware video synthesis, suggesting future evolution toward full 3D- and appearance-aware generative video architectures.

Markdown Report Issue