Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GenXD: Generating Any 3D and 4D Scenes (2411.02319v2)

Published 4 Nov 2024 in cs.CV and cs.AI

Abstract: Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  2. Zip-nerf: Anti-aliased grid-based neural radiance fields. In ICCV, 2023.
  3. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  5. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  6. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  7. Tracking anything with decoupled video segmentation. In ICCV, 2023.
  8. Objaverse: A universe of annotated 3d objects. 2023 ieee. In CVPR, 2023.
  9. Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS, 2024.
  10. TAPIR: Tracking any point with per-frame initialization and temporal refinement. In ICCV, 2023.
  11. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024.
  12. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024.
  13. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  15. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  16. X-ray: A sequential 3d representation for generation. arXiv preprint arXiv:2404.14329, 2024.
  17. Video interpolation with diffusion models. In CVPR, 2024.
  18. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 2023.
  19. Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  20. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  21. Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models. arXiv preprint arXiv:2405.16645, 2024.
  22. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In CVPR, 2024.
  23. Infinite nature: Perpetual view generation of natural scenes from a single image. In ICCV, 2021.
  24. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  25. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In CVPR, 2024a.
  26. Meshformer: High-quality mesh generation with 3d-guided reconstruction model. arXiv preprint arXiv:2408.10198, 2024b.
  27. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023b.
  28. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  29. Wonder3d: Single image to 3d using cross-domain diffusion. In CVPR, 2024.
  30. Decoupled weight decay regularization. In ICLR, 2019.
  31. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation. arXiv preprint arXiv:2409.04410, 2024.
  32. Realfusion: 360{{\{{\\\backslash\deg}}\}} reconstruction of any object from a single image. In CVPR, 2023.
  33. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024.
  34. Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, 2022.
  35. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (ToG), 2019.
  36. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024.
  37. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  38. Julius Plücker. Analytisch-geometrische Entwicklungen, volume 2. GD Baedeker, 1828.
  39. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2022.
  40. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  41. Learning transferable visual models from natural language supervision. In ICML, pp.  8748–8763. PMLR, 2021.
  42. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021.
  43. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023.
  44. L4gm: Large 4d gaussian reconstruction model. arXiv preprint arXiv:2406.10324, 2024.
  45. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  46. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. In CVPR, 2024.
  47. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  48. Text-to-4d dynamic scene generation. In ICML, 2023.
  49. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
  50. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  51. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  52. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
  53. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
  54. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  55. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024.
  56. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  57. Motionctrl: A unified and flexible motion controller for video generation. In SIGGRAPH, 2024.
  58. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024a.
  59. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, 2024b.
  60. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470, 2024.
  61. Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509, 2024.
  62. Depth anything v2. arXiv preprint arXiv:2406.09414, 2024.
  63. Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.
  64. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In ECCV, 2022.
  65. Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603, 2023.
  66. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  67. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
Citations (3)

Summary

  • The paper introduces a unified framework combining latent diffusion and multiview-temporal modules to generate both 3D and 4D scenes.
  • It employs a novel data curation pipeline to extract camera poses and object motions, creating the CamVid-30K dataset with 30K real-world samples.
  • Extensive experiments show that GenXD outperforms state-of-the-art methods in single-view 3D object generation and few-view 3D scene reconstruction.

Overview of "X: Generating any 3D and 4D Scenes"

The paper "X: Generating any 3D and 4D Scenes" introduces a novel framework, X, designed to address the challenges associated with 3D and 4D scene generation. The focus is on leveraging both existing 3D and newly curated 4D datasets to train a unified model capable of producing high-quality scenes from minimal conditioning images. This approach aims to overcome limitations imposed by the lack of large-scale real-world 4D data and efficient model designs for dynamic scene representation.

The primary contribution of this work lies in its unified framework that handles both static (3D) and dynamic (4D) generation tasks seamlessly. The authors introduce a data curation pipeline that extracts both camera poses and object motion strengths from video inputs, culminating in the creation of a new dataset referred to as CamVid-30K. This dataset addresses a significant gap by incorporating approximately 30,000 real-world 4D data samples, thus providing a foundation for enhancing 4D generation models.

The X model capitalizes on a latent diffusion model to transform input conditions into 3D and 4D outputs. A particularly innovative aspect of the framework is its incorporation of multiview-temporal modules. These modules facilitate the disentanglement of spatial and temporal information, allowing the model to effectively learn from both 3D and 4D data. Furthermore, X employs masked latent conditions that support varying numbers of conditioning views, offering flexibility and scalability in generating consistent outputs across several applications.

The empirical assessments presented in the paper highlight the model's comparative advantage over existing methods. Evaluations conducted across a spectrum of tasks demonstrate that the X framework not only competes with but often surpasses state-of-the-art methods in both 3D object and scene generation, as well as in 4D video generation. Notably, X achieves significant performance improvements in both single-view 3D object generation and few-view 3D scene reconstruction.

Implications and Future Developments

The strong performance of X implies wide-ranging implications for industries reliant on 3D content generation, such as gaming, augmented reality, and virtual reality. By providing a robust mechanism for generating high-quality 3D and 4D scenes, the framework can enhance content creation workflows, reduce resource dependencies, and enable more immersive user experiences.

Theoretically, the insight provided by the multiview-temporal modules contributes to a deeper understanding of how spatial and temporal information can be decoupled and subsequently leveraged for scene generation. This could inform future architectures targeting similar generative tasks beyond the scope of this paper.

Looking forward, the introduction of the CamVid-30K dataset opens avenues for further research into more realistic and dynamic scene generation. Future developments may focus on expanding the diversity and complexity of such datasets, enabling models like X to generalize better to real-world scenarios. Additionally, refinements to the model's architecture may enhance its ability to capture finer details in highly dynamic environments, potentially integrating concepts from physics-based simulations or neural rendering techniques.

In summary, this paper represents a substantial step toward more versatile and scalable generative models, emphasizing the importance of integrating diverse datasets and novel architectural components to push the boundaries in computer-generated scene realism.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews