CAT3D: Create Anything in 3D with Multi-View Diffusion Models (2405.10314v1)
Abstract: Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at https://cat3d.github.io .
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV, 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. SIGGRAPH, 2022.
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH, 2023.
- FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization. CVPR, 2023.
- SimpleNeRF: Regularizing Sparse Input Neural Radiance Fields with Simpler Solutions. SIGGRAPH Asia, 2023.
- LRM: Large Reconstruction Model for Single Image to 3D. arXiv:2311.04400, 2023.
- Reconfusion: 3d reconstruction with diffusion priors, 2023.
- DreamFusion: Text-to-3D using 2D Diffusion. ICLR, 2022.
- Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv:2312.02201, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. CVPR, 2023.
- Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv:2311.10709, 2023.
- Lumiere: A space-time diffusion model for video generation. arXiv, 2024.
- Photorealistic video generation with diffusion models, 2023.
- Video generation models as world simulators. 2024.
- State of the art on diffusion models for visual computing. arXiv:2310.07204, 2023.
- IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024.
- DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation. arXiv, 2023.
- SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity. arXiv, 2023.
- Collaborative score distillation for consistent visual editing. NeurIPS, 36, 2024.
- ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. NeurIPS, 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19740–19750, 2023.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. ICCV, 2023.
- Magic3D: High-Resolution Text-to-3D Content Creation. CVPR, 2023.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv:2309.16653, 2023.
- Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv:2310.08529, 2023.
- Disentangled 3d scene generation with layout learning. arXiv preprint arXiv:2402.16936, 2024.
- ATT3D: Amortized Text-to-3D Object Synthesis. ICCV, 2023.
- Realfusion: 360deg reconstruction of any object from a single image. CVPR, 2023.
- Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors. arXiv:2306.17843, 2023.
- Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior. ICCV, 2023.
- Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV, 2023.
- Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023.
- WonderJourney: Going from Anywhere to Everywhere. arXiv:2312.03884, 2023.
- Nerfiller: Completing scenes via generative 3d inpainting. arXiv preprint arXiv:2312.04560, 2023.
- Zero-1-to-3: Zero-Shot One Image to 3D Object. arXiv, 2023.
- Novel view synthesis with diffusion models. arXiv:2210.04628, 2022.
- DreamBooth3D: Subject-Driven Text-to-3D Generation. ICCV, 2023.
- NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. ICML, 2023.
- GeNVS: Generative novel view synthesis with 3D-aware diffusion models. arXiv, 2023.
- ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR, 2024.
- One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. arXiv, 2023.
- MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023.
- Zero123++: a single image to consistent multi-view diffusion base model, 2023.
- ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion. arXiv:2310.10343, 2023.
- SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv, 2023.
- Viewdiff: 3d-consistent image generation with text-to-image models, 2024.
- Video diffusion models. arXiv:2204.03458, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.
- Video interpolation with diffusion models. arXiv preprint arXiv:2404.01203, 2024.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023.
- ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models. arXiv:2312.01305, 2023.
- SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024.
- 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371, 2023.
- Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data. ICCV, 2023.
- DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023.
- Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model. arXiv:2311.06214, 2023.
- Splatter image: Ultra-fast single-view 3d reconstruction. arXiv:2312.13150, 2023.
- GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting. arXiv:2404.19702, 2024.
- Auto-encoding variational bayes. arXiv:1312.6114, 2013.
- High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022.
- pixelNeRF: Neural Radiance Fields from One or Few Images. CVPR, 2021.
- Learning transferable visual models from natural language supervision. ICML, 2021.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35, 2022.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023.
- Simple diffusion: End-to-end diffusion for high resolution images. ICML, 2023.
- Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. CVPR, 2022.
- k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, 2007.
- Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry… for now. arXiv:2311.17138, 2023.
- Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. CVPR, 2018.
- Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR, 2022.
- Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. ICCV, 2021.
- Objaverse: A universe of annotated 3d objects. CVPR, 2023.
- Stereo magnification: Learning view synthesis using multiplane images. SIGGRAPH, 2018.
- MVImgNet: A Large-scale Dataset of Multi-view Images. CVPR, 2023.
- Large scale multi-view stereopsis evaluation. CVPR, 2014.
- Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines. SIGGRAPH, 2019.
- RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024.
- DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior. arXiv, 2023.
- Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, 2023.
- Triposr: Fast 3d object reconstruction from a single image, 2024.
- One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv:2311.07885, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Ruiqi Gao (44 papers)
- Aleksander Holynski (37 papers)
- Philipp Henzler (18 papers)
- Arthur Brussee (5 papers)
- Ricardo Martin-Brualla (28 papers)
- Pratul Srinivasan (8 papers)
- Jonathan T. Barron (89 papers)
- Ben Poole (46 papers)