Vox-E: Text-guided Voxel Editing of 3D Objects (2303.12048v3)
Abstract: Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.
- Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans. Graph., 40(3), May 2021.
- Changeit3d: Language-assisted 3d shape edits and deformations, 2022.
- Interactive digital photomontage. ACM Trans. Graph., 23(3):294–302, 2004.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- Text2live: Text-driven layered image and video editing. In ECCV, 2022.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
- Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, 2001.
- Instructpix2pix: Learning to follow image editing instructions. November 2022.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
- Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. arXiv preprint arXiv:2210.11277, 2022.
- Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
- Stylegan-nada: Clip-guided domain adaptation of image generators, 2021.
- Voltemorph: Realtime, controllable and generalisable animation of volumetric representations. arXiv preprint arXiv:2208.00949, 2022.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023.
- Prompt-to-prompt image editing with cross attention control. 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Ladis: Language disentanglement for 3d shape editing. arXiv preprint arXiv:2212.05011, 2022.
- As-rigid-as-possible shape manipulation. ACM transactions on Graphics (TOG), 24(3):1134–1141, 2005.
- Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
- Relu fields: The little non-linearity that could. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 35:23311–23330, 2022.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Editing conditional radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5773–5783, 2021.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
- Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
- Text2mesh: Text-driven neural stylization for meshes. In CVPR, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Zero-shot image-to-image translation. ArXiv, abs/2302.03027, 2023.
- Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022.
- Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, 08 2020. Version 0.2.1.
- Interpreting the latent space of gans for semantic face editing. In CVPR, 2020.
- As-rigid-as-possible surface modeling. In Symposium on Geometry processing, volume 4, pages 109–116, 2007.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
- Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
- Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022.
- Nerf-art: Text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070, 2022.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. ArXiv, abs/2212.00774, 2022.
- Neutex: Neural texture mapping for volumetric neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7128, 2021.
- Deforming radiance fields with cages. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 159–175. Springer, 2022.
- Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 597–614. Springer, 2022.
- Nerf-editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18353–18364, 2022.
- Arf: Artistic radiance fields, 2022.
- Etai Sella (3 papers)
- Gal Fiebelman (4 papers)
- Peter Hedman (21 papers)
- Hadar Averbuch-Elor (43 papers)