Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video (2401.08742v3)
Abstract: Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. An intuitive approach is to extend previous image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling.However, this approach would be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly reconstruct the 4D content through a 4D Gaussian splatting model. Importantly, our method can achieve real-time rendering under continuous camera trajectories. To enable robust reconstruction under sparse views, we introduce inconsistency-aware confidence-weighted loss design, along with a lightly weighted score distillation loss. Extensive experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 10 minutes to model a dynamic object, vs 120 minutes by the previous art model Consistent4D.
- Maximo. https://www.mixamo.com/, 2023.
- Skectchfab. https://sketchfab.com/, 2023.
- Hexplane: A fast representation for dynamic scenes. In ICCV, 2023.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, 2023.
- Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint, 2023a.
- Objaverse: A universe of annotated 3d objects. In CVPR, 2023b.
- K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023.
- Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint, 2023.
- Real-time intermediate flow estimation for video frame interpolation. In ECCV, 2022.
- Consistent4d: Consistent 360° dynamic object generation from monocular video. arxiv, 2023.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint, 2023.
- 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023.
- Segment anything. In ICCV, 2023.
- Neural 3d video synthesis from multi-view video. In CVPR, 2022.
- Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint, 2023a.
- Dynibar: Neural dynamic image-based rendering. In CVPR, 2023b.
- Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
- Devrf: Fast deformable voxel radiance fields for dynamic scenes. NeurIPS, 2022.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint, 2023a.
- Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
- Syncdreamer: Learning to generate multiview-consistent images from a single-view image. arXiv preprint, 2023c.
- Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint, 2023.
- Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint, 2022.
- Hypernerf: a higher-dimensional representation for topologically varying neural radiance fields. In ACM TOG, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
- D-nerf: Neural radiance fields for dynamic scenes. In CVPR, 2021.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint, 2023.
- Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In CVPR, 2023.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint, 2022.
- Text-to-4d dynamic scene generation. In ICML, 2023.
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint, 2023.
- Textmesh: Generation of realistic 3d meshes from text prompts. In 3D vision, 2024.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021.
- Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
- Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. arXiv preprint, 2023.
- Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023.
- Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint, 2023.
- Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint, 2023.
- High-quality video view interpolation using a layered representation. ACM TOG, 2004.
- Ewa volume splatting. In Visualization, Vis, 2001.