Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing (2403.14487v1)

Published 21 Mar 2024 in cs.CV

Abstract: Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Blended diffusion for text-driven editing of natural images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022.
  2. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  3. Improving image generation with better captions. Online, 2023. Accessed: 2024-01-03.
  4. Instructpix2pix: Learning to follow image editing instructions, 2023.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
  7. Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023a.
  8. Subject-driven text-to-image generation via apprenticeship learning, 2023b.
  9. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  10. Vector quantized diffusion model for text-to-image synthesis, 2022.
  11. Improving tuning-free real image editing with proximal guidance. CoRR, 2023.
  12. Prompt-to-prompt image editing with cross attention control, 2022.
  13. Style aligned image generation via shared attention, 2024.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
  15. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  16. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
  17. Diffeditor: Boosting accuracy and flexibility on diffusion-based image editing. arXiv preprint arXiv:2402.02583, 2024.
  18. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
  19. OpenAI. DALL·E 3 System Card. Online, 2023. Accessed: 2024-01-03.
  20. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  21. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  22. Learning transferable visual models from natural language supervision, 2021.
  23. Hierarchical text-conditional image generation with clip latents, 2022.
  24. High-resolution image synthesis with latent diffusion models, 2022.
  25. U-net: Convolutional networks for biomedical image segmentation, 2015.
  26. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
  27. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  28. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  29. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  30. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022.
  31. Attention is all you need, 2023.
  32. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
  33. Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 2023a.
  34. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023b.
  35. Freedom: Training-free energy-guided conditional diffusion model, 2023.
  36. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023a.
  37. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
  38. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
  39. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yueru Jia (3 papers)
  2. Yuhui Yuan (42 papers)
  3. Aosong Cheng (4 papers)
  4. Chuke Wang (2 papers)
  5. Ji Li (186 papers)
  6. Huizhu Jia (15 papers)
  7. Shanghang Zhang (172 papers)
Citations (3)