This paper introduces the Region-Aware Diffusion Model (RDM), a novel framework for zero-shot text-driven image editing at the entity level. Unlike previous methods often requiring manual masks, RDM automatically identifies the region of interest based on a positioning text prompt () and modifies it according to a target text prompt ().
The core components of RDM are:
- Intensive Diffusion Model: To balance fidelity and speed, RDM utilizes a diffusion process operating in the latent space of a pre-trained autoencoder (VAE), similar to Latent Diffusion Models (LDMs). It enhances this process with Enhanced Directional Guidance, a modified classifier-free guidance mechanism. This guidance steers the generation process more strongly towards the target text () by amplifying the difference between the conditional (text-guided) and unconditional noise predictions, improving image realism and text-image semantic consistency.
- Regional-aware Entity Editing: This module handles the spatial aspects of the edit.
- Cross-modal Entity-level Calibration: Uses a pre-trained CLIP model (ViT-B/16) and a lightweight segmentation decoder. Given the positioning text , it processes the input image's visual features and the text embedding to generate a binary segmentation mask identifying the entity to be edited.
- Region of Interest Synthesizing: Guides the diffusion process within the generated mask using a CLIP-based loss (). This loss minimizes the cosine distance between the CLIP embedding of the masked generated region and the CLIP embedding of the target text .
- Region out of Interest Preserving (NERP): To prevent unwanted changes in the background, RDM incorporates two strategies:
- Latent Blending: At each denoising step , the diffusion output is blended with a noisy version of the original input image () outside the latent mask (): . This enforces preservation of the non-edited regions.
- NERP Loss: A loss function is added to penalize deviations in the non-edited regions. It combines LPIPS and MSE between the original non-edited region () and the generated non-edited region ().
Implementation and Experiments:
- RDM uses a pre-trained LDM (1.45B parameters on LAION-400M) and CLIP ViT-L/14.
- It generates images in ~3 seconds on an RTX 3090 GPU.
- Qualitative results demonstrate high-quality, diverse edits on various images, preserving background details and creating natural transitions.
- Quantitative comparisons against Latent Diffusion, GLIDE, Blended Diffusion, and CLIP-guided Diffusion using CLIP score (semantic alignment), SFID (image quality), and Image Harmonization (IH) score show RDM performs competitively or outperforms baselines. RDM achieves the best CLIP score and IH score.
- A user paper indicates a preference for RDM's results regarding quality, harmony, and text consistency.
- Ablation studies confirm the positive impact of the NERP component (improving LPIPS significantly) and the classifier-free guidance scale. They also explore the effect of the mask generation threshold.
- Failure cases are noted, particularly when CLIP exhibits strong biases (e.g., associating "water" with transparent cups) or when the source and target objects have vastly different shapes.
In conclusion, RDM presents an effective method for zero-shot, text-driven regional image editing that automatically handles object localization via text, performs high-fidelity synthesis using guided latent diffusion, and preserves content outside the edited region.