Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZONE: Zero-Shot Instruction-Guided Local Editing (2312.16794v2)

Published 28 Dec 2023 in cs.CV

Abstract: Recent advances in vision-LLMs like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  2. Blended latent diffusion. TOG, 2023.
  3. Text2live: Text-driven layered image and video editing. In ECCV, 2022.
  4. End-to-end conditional gan-based architectures for image colourisation. In MMSPW, 2019.
  5. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  6. Language models are few-shot learners. In NeurIPS, 2020.
  7. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  8. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  9. Vqgan-clip: Open domain image generation and editing with natural language guidance. In ECCV, 2022.
  10. Tuning-free inversion-enhanced control for consistent image editing. arXiv preprint arXiv:2312.14611, 2023.
  11. Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction. In CVPR, 2019.
  12. Stylegan-nada: Clip-guided domain adaptation of image generators. TOG, 2022.
  13. Implicit diffusion models for continuous super-resolution. In CVPR, 2023.
  14. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  15. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  16. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  17. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  18. Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
  19. Globally and locally consistent image completion. TOG, 2017.
  20. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  21. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  22. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  23. Analyzing and improving the image quality of stylegan. In CVPR, 2020.
  24. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  25. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
  26. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
  27. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  28. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  29. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  30. Clipstyler: Image style transfer with a single text condition. In CVPR, 2022.
  31. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  33. Diffcolor: Toward high fidelity text-guided image colorization with diffusion models. arXiv preprint arXiv:2308.01655, 2023.
  34. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  35. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  36. Self-distilled stylegan: Towards generation from internet photos. In SIGGRAPH, 2022.
  37. Image colorization using generative adversarial networks. In AMDO, 2018.
  38. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  39. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  40. Zero-shot image-to-image translation. In SIGGRAPH, 2023.
  41. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  42. Learning transferable visual models from natural language supervision. In ICML, 2021.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  44. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  45. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  46. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022a.
  47. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
  48. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  49. Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  50. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  51. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  52. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  53. Unsupervised deep exemplar colorization via pyramid dual non-local attention. TIP, 2023.
  54. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
  55. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  56. Resshift: Efficient diffusion model for image super-resolution by residual shifting. arXiv preprint arXiv:2307.12348, 2023.
  57. Ipdreamer: Appearance-controllable 3d object generation with image prompts. arXiv preprint arXiv:2310.05375, 2023a.
  58. Controllable mind visual diffusion model. arXiv preprint arXiv:2305.10135, 2023b.
  59. Magicbrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023a.
  60. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  61. Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023b.
  62. Text as neural operator: Image manipulation by text instruction. In ACMMM, 2021.
  63. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Shanglin Li (7 papers)
  2. Bohan Zeng (19 papers)
  3. Yutang Feng (2 papers)
  4. Sicheng Gao (5 papers)
  5. Xuhui Liu (17 papers)
  6. Jiaming Liu (156 papers)
  7. Li Lin (91 papers)
  8. Xu Tang (48 papers)
  9. Yao Hu (106 papers)
  10. Jianzhuang Liu (90 papers)
  11. Baochang Zhang (113 papers)
Citations (18)
Github Logo Streamline Icon: https://streamlinehq.com