GenXD: Generating Any 3D and 4D Scenes (2411.02319v2)
Abstract: Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Zip-nerf: Anti-aliased grid-based neural radiance fields. In ICCV, 2023.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Tracking anything with decoupled video segmentation. In ICCV, 2023.
- Objaverse: A universe of annotated 3d objects. 2023 ieee. In CVPR, 2023.
- Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS, 2024.
- TAPIR: Tracking any point with per-frame initialization and temporal refinement. In ICCV, 2023.
- Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024.
- Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024.
- Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
- X-ray: A sequential 3d representation for generation. arXiv preprint arXiv:2404.14329, 2024.
- Video interpolation with diffusion models. In CVPR, 2024.
- 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 2023.
- Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models. arXiv preprint arXiv:2405.16645, 2024.
- Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In CVPR, 2024.
- Infinite nature: Perpetual view generation of natural scenes from a single image. In ICCV, 2021.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
- One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In CVPR, 2024a.
- Meshformer: High-quality mesh generation with 3d-guided reconstruction model. arXiv preprint arXiv:2408.10198, 2024b.
- Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023b.
- Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
- Wonder3d: Single image to 3d using cross-domain diffusion. In CVPR, 2024.
- Decoupled weight decay regularization. In ICLR, 2019.
- Open-magvit2: An open-source project toward democratizing auto-regressive visual generation. arXiv preprint arXiv:2409.04410, 2024.
- Realfusion: 360{{\{{\\\backslash\deg}}\}} reconstruction of any object from a single image. In CVPR, 2023.
- Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024.
- Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, 2022.
- Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (ToG), 2019.
- Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- Julius Plücker. Analytisch-geometrische Entwicklungen, volume 2. GD Baedeker, 1828.
- Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2022.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. PMLR, 2021.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021.
- Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023.
- L4gm: Large 4d gaussian reconstruction model. arXiv preprint arXiv:2406.10324, 2024.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Zeronvs: Zero-shot 360-degree view synthesis from a single real image. In CVPR, 2024.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Text-to-4d dynamic scene generation. In ICML, 2023.
- Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
- Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
- Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
- Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024.
- Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
- Motionctrl: A unified and flexible motion controller for video generation. In SIGGRAPH, 2024.
- 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024a.
- Reconfusion: 3d reconstruction with diffusion priors. In CVPR, 2024b.
- Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470, 2024.
- Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509, 2024.
- Depth anything v2. arXiv preprint arXiv:2406.09414, 2024.
- Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.
- Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In ECCV, 2022.
- Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.