Free-Editor: Zero-shot Text-driven 3D Scene Editing (2312.13663v2)
Abstract: Text-to-Image (T2I) diffusion models have recently gained traction for their versatility and user-friendliness in 2D content generation and editing. However, training a diffusion model specifically for 3D scene editing is challenging due to the scarcity of large-scale datasets. Currently, editing 3D scenes necessitates either retraining the model to accommodate various 3D edits or developing specific methods tailored to each unique editing type. Moreover, state-of-the-art (SOTA) techniques require multiple synchronized edited images from the same scene to enable effective scene editing. Given the current limitations of T2I models, achieving consistent editing effects across multiple images remains difficult, leading to multi-view inconsistency in editing. This inconsistency undermines the performance of 3D scene editing when these images are utilized. In this study, we introduce a novel, training-free 3D scene editing technique called \textsc{Free-Editor}, which enables users to edit 3D scenes without the need for model retraining during the testing phase. Our method effectively addresses the issue of multi-view style inconsistency found in state-of-the-art (SOTA) methods through the implementation of a single-view editing scheme. Specifically, we demonstrate that editing a particular 3D scene can be achieved by modifying only a single view. To facilitate this, we present an Edit Transformer that ensures intra-view consistency and inter-view style transfer using self-view and cross-view attention mechanisms, respectively. By eliminating the need for model retraining and multi-view editing, our approach significantly reduces editing time and memory resource requirements, achieving runtimes approximately 20 times faster than SOTA methods. We have performed extensive experiments on various benchmark datasets, showcasing the diverse editing capabilities of our proposed technique.
- Clip2stylegan: Unsupervised extraction of stylegan edit directions. corr abs/2112.05219 (2021). arXiv preprint arXiv:2112.05219, 2021a.
- Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (ToG), 40(3):1–21, 2021b.
- Blended diffusion for text-driven editing of natural images. In CVPR 2022, pages 18208–18218, 2022.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
- Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
- Pvd-al: Progressive volume distillation with active learning for efficient conversion between different nerf architectures. arXiv preprint arXiv:2304.04012, 2023a.
- One is all: Bridging the gap between neural radiance fields architectures with progressive volume distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 597–605, 2023b.
- Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2367–2376, 2019.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Hiclip: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995, 2023.
- Blended-nerf: Zero-shot object generation and blending in existing neural radiance fields. arXiv preprint arXiv:2306.12760, 2023.
- Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
- Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Stylemesh: Style transfer for indoor 3d scene reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6198–6208, 2022.
- Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
- Geonerf: Generalizing nerf with geometry priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- 3d-aware blending with generative nerfs. arXiv preprint arXiv:2302.06608, 2023.
- Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12912–12921, 2022.
- Decomposing nerf for editing via feature field distillation. arXiv preprint arXiv:2205.15585, 2022.
- Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Neural sparse voxel fields. NeurIPS 2020, 33:15651–15663, 2020.
- More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
- Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7824–7833, 2022.
- Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021a.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021b.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
- Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5589–5599, 2021.
- Image-based modeling and photo editing. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 433–442, 2001.
- Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In ICML 2021, pages 8748–8763, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR 2022, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS 2022, 35:36479–36494, 2022.
- Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6229–6238, 2022.
- Vox-e: Text-guided voxel editing of 3d objects. arXiv preprint arXiv:2303.12048, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Diffusion guided domain adaptation of image generators. arXiv preprint arXiv:2212.04473, 2022.
- Generalizable patch-based neural rendering. In European Conference on Computer Vision, pages 156–174. Springer, 2022.
- Block-nerf: Scalable large scene neural view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8248–8258, 2022.
- Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–12, 2023.
- Compressible-composable nerf via rank-residual decomposition. Advances in Neural Information Processing Systems, 35:14798–14809, 2022.
- Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR 2022, pages 3835–3844, 2022a.
- Nerf-sr: High quality neural radiance fields using supersampling. In ACM MM 2022, pages 6445–6454, 2022b.
- Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics, 2023.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
- Is attention all nerf needs? arXiv preprint arXiv:2207.13298, 2022c.
- Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021b.
- Nex: Real-time view synthesis with neural basis expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8534–8543, 2021.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR 2018, pages 1316–1324, 2018.
- Zero-shot contrastive loss for text-guided diffusion image style transfer. arXiv preprint arXiv:2303.08622, 2023.
- Mvs2d: Efficient multi-view stereo via attention-driven 2d convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8574–8584, 2022.
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 833–842, 2021.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.