Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Consolidating Attention Features for Multi-view Image Editing (2402.14792v1)

Published 22 Feb 2024 in cs.CV, cs.GR, and cs.LG

Abstract: Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Cross-image attention for zero-shot appearance transfer, 2023.
  2. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
  3. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision (ECCV), 2022.
  4. Loosecontrol: Lifting controlnet for generalized depth conditioning, 2023.
  5. Demystifying mmd gans. In International Conference on Learning Representations, 2018.
  6. Sega: Instructing text-to-image models using semantic guidance, 2023.
  7. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  8. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing, 2023a.
  9. Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
  10. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  11. Efficient geometry-aware 3D generative adversarial networks. In arXiv, 2021.
  12. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
  13. Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  14. Depth-supervised NeRF: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  15. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  16. Diffusion self-guidance for controllable image generation. 2023.
  17. Pie-nerf: Physics-based interactive elastodynamics with nerf, 2023.
  18. Expressive text-to-image generation with rich text. In IEEE International Conference on Computer Vision (ICCV), 2023.
  19. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
  20. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  21. Prompt-to-prompt image editing with cross attention control. 2022.
  22. Style aligned image generation via shared attention. 2023.
  23. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  24. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  25. An edit friendly ddpm noise space: Inversion and manipulations, 2023.
  26. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion, 2023.
  27. Imagic: Text-based real image editing with diffusion models. In Conference on Computer Vision and Pattern Recognition 2023, 2023.
  28. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  29. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.
  30. Latenteditor: Text driven local editing of 3d scenes, 2023.
  31. Collaborative score distillation for consistent visual synthesis, 2023.
  32. Segment anything. arXiv:2304.02643, 2023.
  33. Decomposing nerf for editing via feature field distillation. In Advances in Neural Information Processing Systems, 2022.
  34. Posterior distillation sampling. arXiv preprint arXiv:2311.13831, 2023.
  35. Faster diffusion: Rethinking the role of unet encoder in diffusion models, 2023.
  36. Editing conditional radiance fields. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  37. Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, 1995.
  38. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  39. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  40. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
  41. Giraffe: Representing scenes as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
  42. Zero-shot image-to-image translation. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
  43. Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  44. Texture: Text-guided texturing of 3d shapes. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
  45. High-resolution image synthesis with latent diffusion models, 2021.
  46. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  47. Vox-e: Text-guided voxel editing of 3d objects, 2023.
  48. Language-driven object fusion into neural radiance fields with pose-conditioned dataset updates, 2023.
  49. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  50. Efficient-nerf2nerf: Streamlining text-driven 3d editing with multiview correspondence-enhanced diffusion models. arXiv preprint arXiv:2312.08563, 2023.
  51. Nerfstudio: A modular framework for neural radiance field development. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
  52. Neural Feature Fusion Fields: 3D distillation of self-supervised 2D image representations. In Proceedings of the International Conference on 3D Vision (3DV), 2022.
  53. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023.
  54. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022a.
  55. Nerf-art: Text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070, 2022b.
  56. Reconfusion: 3d reconstruction with diffusion priors. arXiv, 2023.
  57. Deforming radiance fields with cages. In ECCV, 2022.
  58. Nerf-editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18353–18364, 2022.
  59. Adding conditional control to text-to-image diffusion models, 2023.
  60. Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.
Citations (2)

Summary

  • The paper introduces QNeRF, which consolidates query features during diffusion to ensure consistent multi-view edits.
  • The method leverages progressive refinement of self-attention features for precise 3D geometric control.
  • Benchmark tests reveal lower KID and FID scores and improved user ratings, demonstrating superior visual fidelity.

Consolidating Attention Features for Multi-view Image Editing

The paper "Consolidating Attention Features for Multi-view Image Editing" addresses the challenge of achieving consistent multi-view image editing using 3D geometric control. Traditional image editing techniques utilizing large-scale text-to-image models often fail to produce consistent edits when applied to multiple images of the same scene, especially when these involve complex geometric adjustments. The authors propose a novel approach that consolidates attention features in diffusion models to maintain consistency across various views during spatial control-based geometric manipulations.

Summary of Methodology

The researchers develop a method that leverages two key insights: the importance of maintaining consistent features throughout the generative process, and the significant influence of queries in self-attention layers on image structure. They introduce QNeRF (Query Neural Radiance Field), which is a neural radiance field trained on internal query features from self-attention layers. This QNeRF is capable of rendering 3D-consistent queries that are progressively consolidated during the image generation process. The methodology involves a progressive, iterative refinement technique, where QNeRF consolidates query features at various diffusion steps, offering a consistent representation that guides the generation of edited images.

Key Numerical Results and Comparisons

The authors compare their method with various baseline techniques, including Instruct-NeRF2NeRF (IN2N) enhanced with ControlNet, collaborative score distillation (CSD), and TokenFlow for video generation, all of which are adapted to incorporate spatial controls akin to ControlNet. Their approach achieves superior multi-view consistency and fidelity to the original scenes compared to these baseline methods. This is quantitatively supported by evaluations such as Kernel Inception Distance (KID) and Fréchet Inception Distance (FID), where their method shows improved fidelity to the original scenes and a higher preference in user studies for 3D alignment and visual quality. Specifically, their method yielded a lower KID and FID than the alternatives and was the preferred choice in user evaluations concerning alignment and quality of the resulting 3D representations.

Implications and Future Directions

The proposed method has significant implications for enhancing 3D consistency in multi-view image editing, particularly for applications requiring precise geometric alterations. By refining the consistency of attention-based features, the approach addresses fundamental limitations in current multi-view editing techniques. This opens new avenues where advanced 3D-consistent editing could be applied more seamlessly in areas such as virtual reality content creation, advanced scene modeling, and interactive design.

Future research could explore further refinement of feature consistency, possibly integrating higher resolution feature alignment to tackle issues with detailed textures and backgrounds. Exploring alternative 3D representations such as Gaussian splats could also provide different consolidation mechanisms and improve computational efficiency. Additionally, extending this framework to handle more generalized scenarios involving dynamic scenes or real-time editing could broaden the applicability and robustness of multi-view image editing models.

Overall, this paper presents a substantial contribution to the field of multi-view image editing, offering a path forward in achieving consistent geometric transformations across multiple views through innovative attention feature consolidation techniques.