ZONE: Zero-Shot Instruction-Guided Local Editing (2312.16794v2)
Abstract: Recent advances in vision-LLMs like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.
- Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
- Blended latent diffusion. TOG, 2023.
- Text2live: Text-driven layered image and video editing. In ECCV, 2022.
- End-to-end conditional gan-based architectures for image colourisation. In MMSPW, 2019.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
- Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In ECCV, 2022.
- Tuning-free inversion-enhanced control for consistent image editing. arXiv preprint arXiv:2312.14611, 2023.
- Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction. In CVPR, 2019.
- Stylegan-nada: Clip-guided domain adaptation of image generators. TOG, 2022.
- Implicit diffusion models for continuous super-resolution. In CVPR, 2023.
- Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
- Globally and locally consistent image completion. TOG, 2017.
- Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
- Scaling up gans for text-to-image synthesis. In CVPR, 2023.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Analyzing and improving the image quality of stylegan. In CVPR, 2020.
- Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
- Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Clipstyler: Image style transfer with a single text condition. In CVPR, 2022.
- Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Diffcolor: Toward high fidelity text-guided image colorization with diffusion models. arXiv preprint arXiv:2308.01655, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Self-distilled stylegan: Towards generation from internet photos. In SIGGRAPH, 2022.
- Image colorization using generative adversarial networks. In AMDO, 2018.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Zero-shot image-to-image translation. In SIGGRAPH, 2023.
- Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Palette: Image-to-image diffusion models. In SIGGRAPH, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Unsupervised deep exemplar colorization via pyramid dual non-local attention. TIP, 2023.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Resshift: Efficient diffusion model for image super-resolution by residual shifting. arXiv preprint arXiv:2307.12348, 2023.
- Ipdreamer: Appearance-controllable 3d object generation with image prompts. arXiv preprint arXiv:2310.05375, 2023a.
- Controllable mind visual diffusion model. arXiv preprint arXiv:2305.10135, 2023b.
- Magicbrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023a.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023b.
- Text as neural operator: Image manipulation by text instruction. In ACMMM, 2021.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
- Shanglin Li (7 papers)
- Bohan Zeng (19 papers)
- Yutang Feng (2 papers)
- Sicheng Gao (5 papers)
- Xuhui Liu (17 papers)
- Jiaming Liu (156 papers)
- Li Lin (91 papers)
- Xu Tang (48 papers)
- Yao Hu (106 papers)
- Jianzhuang Liu (90 papers)
- Baochang Zhang (113 papers)