DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion (2411.04928v1)

Published 7 Nov 2024 in cs.CV, cs.AI, and cs.GR

Abstract: In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.

Citations (5)

View on Semantic Scholar

Summary

The paper presents its main contribution: a novel video diffusion framework that disentangles spatial and temporal dynamics using dimension-specific LoRAs.
It employs a training-free switch mechanism to balance structure synthesis and motion control for 3D and 4D scene generation.
Extensive experiments demonstrate superior realism and consistency over existing methods in controllable scene reconstruction.

DimensionX: Controllable 3D and 4D Scene Generation from a Single Image

The paper under consideration presents "DimensionX," a novel framework for generating 3D and 4D scenes from a single image using controllable video diffusion models. The core contribution of DimensionX lies in the introduction of a controllable video diffusion module named ST-Director, which separates spatial and temporal components during the generative process. This is achieved through dimension-aware Low-Rank Adaptation (LoRA) layers trained on curated datasets that vary independently across spatial and temporal axes.

Summary of Key Contributions

ST-Director Architecture: The ST-Director framework increases control over video diffusion models by disentangling spatial and temporal dynamics. This separation is accomplished by training on datasets that depict either spatial or temporal variance, using dimension-specific LoRAs.
Hybrid-Dimension Control: Through a training-free switch mechanism between spatial and temporal directors early in the denoising process, DimensionX orchestrates a balanced synthesis of structure and motion.
Trajectory and Identity Mechanisms: To enhance generalization and realism, a trajectory-aware strategy manages varying camera motions for 3D scene synthesis, while an identity-preserving strategy ensures consistency across reference-informed 4D reconstructions.

Experimental Validations

Extensive experiments demonstrate DimensionX's superiority over existing methods across several metrics—particularly in generating controllable and high-fidelity videos from static input images. When compared to models like CogVideoX and Dream Machine 1.6, DimensionX excels in maintaining subject consistency and delivering dynamic content while achieving significant visual coherence.

On test datasets, DimensionX provides robust novel view synthesis, effectively addressing typical challenges such as generating scenes from severely limited visual inputs. The trajectory-aware methodology, combined with the identity-preserving denoising, contributes to marked improvements in spatial and temporal consistency, as verified by metrics like PSNR, SSIM, and LPIPS.

Implications and Future Work

The potential implications of this work in synthesizing realistic and controllable 3D and 4D visual content are manifold. Practically, this could enhance applications in virtual reality, gaming, and animated filmmaking where true-to-life environmental reconstruction from minimal input data is required. Theoretically, this paper contributes a methodological leap towards improving and understanding dimension-wise controllability in generative models.

Future developments might involve expanding the scalability of video diffusion models to handle intricate detail levels or accelerate the inference process, potentially through more efficient model architectures or hybrid techniques combining diffusion models with alternative generative approaches.

Overall, DimensionX sets a promising direction in the field of video diffusion models for creating dynamic and interactive visual environments from static images. The separation of spatial and temporal dynamics, combined with powerful generative capabilities, positions this framework as a noteworthy step towards more generalizable and adaptable AI-driven generation tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/minchoi/status/1858968282344419472

https://twitter.com/camenduru/status/1855550078930264117

https://twitter.com/zhenjun_zhao/status/1854731453160137186

https://twitter.com/leo_grundstrom/status/1859965937912562006

https://twitter.com/Ryansikorski10/status/1883680659853754789

https://twitter.com/Videogirl25/status/1855890514693816701

YouTube

Show All Videos