Region-Aware Diffusion for Zero-shot Text-driven Image Editing (2302.11797v1)

Published 23 Feb 2023 in cs.CV, cs.GR, and cs.MM

Abstract: Image manipulation under the guidance of textual descriptions has recently received a broad range of attention. In this study, we focus on the regional editing of images with the guidance of given text prompts. Different from current mask-based image editing methods, we propose a novel region-aware diffusion model (RDM) for entity-level image editing, which could automatically locate the region of interest and replace it following given text prompts. To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline by combing latent space diffusion and enhanced directional guidance. In addition, to preserve image content in non-edited regions, we introduce regional-aware entity editing to modify the region of interest and preserve the out-of-interest region. We validate the proposed RDM beyond the baseline methods through extensive qualitative and quantitative experiments. The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency. The codes are available at https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model.

PDF Abstract

This paper introduces the Region-Aware Diffusion Model (RDM), a novel framework for zero-shot text-driven image editing at the entity level. Unlike previous methods often requiring manual masks, RDM automatically identifies the region of interest based on a positioning text prompt ( $t_1$ ) and modifies it according to a target text prompt ( $t_2$ ).

The core components of RDM are:

Intensive Diffusion Model: To balance fidelity and speed, RDM utilizes a diffusion process operating in the latent space of a pre-trained autoencoder (VAE), similar to Latent Diffusion Models (LDMs). It enhances this process with Enhanced Directional Guidance, a modified classifier-free guidance mechanism. This guidance steers the generation process more strongly towards the target text ( $t_2$ ) by amplifying the difference between the conditional (text-guided) and unconditional noise predictions, improving image realism and text-image semantic consistency.
Regional-aware Entity Editing: This module handles the spatial aspects of the edit.
- Cross-modal Entity-level Calibration: Uses a pre-trained CLIP model (ViT-B/16) and a lightweight segmentation decoder. Given the positioning text $t_1$ , it processes the input image's visual features and the text embedding to generate a binary segmentation mask $m$ identifying the entity to be edited.
- Region of Interest Synthesizing: Guides the diffusion process within the generated mask $m$ using a CLIP-based loss ( $\mathcal{L}_{CLIP}$ ). This loss minimizes the cosine distance between the CLIP embedding of the masked generated region and the CLIP embedding of the target text $t_2$ .
- Region out of Interest Preserving (NERP): To prevent unwanted changes in the background, RDM incorporates two strategies:
  - Latent Blending: At each denoising step $t$ , the diffusion output is blended with a noisy version of the original input image ( $z_t$ ) outside the latent mask ( $m_{latent}$ ): $\hat{z}_{t-1} = \hat{z}_{t-1}^{\prime \prime}\odot m_{\text {latent}}+ z_{t-1} \odot(1-m_{\text {latent}})$ . This enforces preservation of the non-edited regions.
  - NERP Loss: A loss function $\mathcal{L}_{NERP}$ is added to penalize deviations in the non-edited regions. It combines LPIPS and MSE between the original non-edited region ( $x_0 \odot (1-m)$ ) and the generated non-edited region ( $\hat{x}_t \odot (1-m)$ ).

Implementation and Experiments:

RDM uses a pre-trained LDM (1.45B parameters on LAION-400M) and CLIP ViT-L/14.
It generates $256 \times 256$ images in ~3 seconds on an RTX 3090 GPU.
Qualitative results demonstrate high-quality, diverse edits on various images, preserving background details and creating natural transitions.
Quantitative comparisons against Latent Diffusion, GLIDE, Blended Diffusion, and CLIP-guided Diffusion using CLIP score (semantic alignment), SFID (image quality), and Image Harmonization (IH) score show RDM performs competitively or outperforms baselines. RDM achieves the best CLIP score and IH score.
A user paper indicates a preference for RDM's results regarding quality, harmony, and text consistency.
Ablation studies confirm the positive impact of the NERP component (improving LPIPS significantly) and the classifier-free guidance scale. They also explore the effect of the mask generation threshold.
Failure cases are noted, particularly when CLIP exhibits strong biases (e.g., associating "water" with transparent cups) or when the source and target objects have vastly different shapes.

In conclusion, RDM presents an effective method for zero-shot, text-driven regional image editing that automatically handles object localization via text, performs high-fidelity synthesis using guided latent diffusion, and preserves content outside the edited region.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Nisha Huang (10 papers)
Fan Tang (46 papers)
Weiming Dong (50 papers)
Tong-Yee Lee (21 papers)
Changsheng Xu (100 papers)

Citations (20)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - haha-lisa/RDM-Region-Aware-Diffusion-Model (169 stars)