Consolidating Attention Features for Multi-view Image Editing (2402.14792v1)
Abstract: Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.
- Cross-image attention for zero-shot appearance transfer, 2023.
- Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
- Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision (ECCV), 2022.
- Loosecontrol: Lifting controlnet for generalized depth conditioning, 2023.
- Demystifying mmd gans. In International Conference on Learning Representations, 2018.
- Sega: Instructing text-to-image models using semantic guidance, 2023.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing, 2023a.
- Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Efficient geometry-aware 3D generative adversarial networks. In arXiv, 2021.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
- Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
- Depth-supervised NeRF: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Diffusion self-guidance for controllable image generation. 2023.
- Pie-nerf: Physics-based interactive elastodynamics with nerf, 2023.
- Expressive text-to-image generation with rich text. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Prompt-to-prompt image editing with cross attention control. 2022.
- Style aligned image generation via shared attention. 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- An edit friendly ddpm noise space: Inversion and manipulations, 2023.
- Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion, 2023.
- Imagic: Text-based real image editing with diffusion models. In Conference on Computer Vision and Pattern Recognition 2023, 2023.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
- Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.
- Latenteditor: Text driven local editing of 3d scenes, 2023.
- Collaborative score distillation for consistent visual synthesis, 2023.
- Segment anything. arXiv:2304.02643, 2023.
- Decomposing nerf for editing via feature field distillation. In Advances in Neural Information Processing Systems, 2022.
- Posterior distillation sampling. arXiv preprint arXiv:2311.13831, 2023.
- Faster diffusion: Rethinking the role of unet encoder in diffusion models, 2023.
- Editing conditional radiance fields. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, 1995.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
- Giraffe: Representing scenes as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
- Zero-shot image-to-image translation. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
- Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Texture: Text-guided texturing of 3d shapes. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
- High-resolution image synthesis with latent diffusion models, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Vox-e: Text-guided voxel editing of 3d objects, 2023.
- Language-driven object fusion into neural radiance fields with pose-conditioned dataset updates, 2023.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
- Efficient-nerf2nerf: Streamlining text-driven 3d editing with multiview correspondence-enhanced diffusion models. arXiv preprint arXiv:2312.08563, 2023.
- Nerfstudio: A modular framework for neural radiance field development. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
- Neural Feature Fusion Fields: 3D distillation of self-supervised 2D image representations. In Proceedings of the International Conference on 3D Vision (3DV), 2022.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023.
- Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022a.
- Nerf-art: Text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070, 2022b.
- Reconfusion: 3d reconstruction with diffusion priors. arXiv, 2023.
- Deforming radiance fields with cages. In ECCV, 2022.
- Nerf-editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18353–18364, 2022.
- Adding conditional control to text-to-image diffusion models, 2023.
- Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.