- The paper introduces an automatic dataset construction method producing ~200K image, edit prompt, and pseudo-target triplets for weak supervision.
- It leverages latent diffusion models fine-tuned with paired, masked, and perceptual losses to ensure localized edits that preserve background fidelity.
- Evaluation shows iEdit outperforms baselines like SDEdit and DALL-E 2 with improved text alignment and quality metrics via quantitative analysis.
This paper introduces iEdit, a method for localized, text-guided image editing using Latent Diffusion Models (LDMs). The core problem addressed is the difficulty in controlling diffusion models to make specific edits based on text prompts while preserving the rest of the image, especially given the lack of large-scale datasets containing source images, edit prompts, and corresponding target images.
Key Contributions and Implementation Details:
- Automatic Paired Dataset Construction:
- To overcome the lack of supervised data, the authors propose a method to automatically generate a dataset of (source image, edit prompt, pseudo-target image) triplets.
- Process:
1. Start with image-caption pairs from a large dataset (LAION-5B).
2. Generate a simpler caption for the source image using BLIP.
3. Create an "edit prompt" by programmatically modifying the BLIP caption (e.g., replacing nouns or adjectives with antonyms or co-hyponyms using WordNet).
4. Find a "pseudo-target" image by retrieving the nearest neighbor image from the dataset using the mean of the CLIP embeddings of the source image and the edit prompt.
* Outcome: This yields a dataset (approx. 200K samples used) for weakly-supervised training, avoiding manual annotation costs.
Weakly-Supervised Editing Framework:
- Foundation: Built upon Latent Diffusion Models (LDMs), which operate in a compressed latent space for efficiency.
- Training Objective: The model is fine-tuned to predict the noise required to transform a noisy version of the source image's latent (zt) into the pseudo-target image's latent (z2), guided by the edit prompt (y2).
Loss Function (Lpaired): Minimizes the L2 distance between the noise predicted by the U-Net (ϵθ) and the calculated ground truth noise (ϵ2) needed to obtain z2 from zt.
Lpaired=E[∣∣ϵ2−ϵθ(zt,t,τθ(y2))∣∣2]
where zt is derived from the source latent z1, and ϵ2 is calculated using the target latent z2 (Eq. \ref{eqn:ief_gt_noise}).
CLIP Guidance: A global CLIP loss (Lglobal) is added to ensure the final generated image aligns semantically with the edit prompt. The weights for Lpaired and Lglobal are adjusted based on the diffusion timestep t.
- Location Awareness with Masks:
- To encourage edits only in relevant regions and preserve background fidelity, the method incorporates segmentation masks during training (and optionally inference).
- Mask Generation: Masks (M1,M2) for source and pseudo-target images are generated automatically using CLIPSeg, conditioned on the textual difference between the source caption and the edit prompt.
- Masked Loss (Lmask): Replaces Lpaired. It computes the loss differently for foreground (masked) and background (inverse masked) regions.
- Foreground: Predicts target noise (ϵ2) within the target mask (M2).
- Background: Predicts source noise (ϵ1) within the inverse source mask (M1).
Lmask=E[Lmaskfg]+E[Lmaskbg] (See Eq. \ref{eqn:ldm_simple_loss_mask_main} and \ref{eqn:ldm_simple_loss_mask})
- Additional Masked Losses:
- Perceptual Loss (Lperc): Encourages visual similarity between the masked regions of the generated image and the pseudo-target image using VGG features.
- Localized CLIP Loss (Lloc): Enforces semantic alignment between the masked region of the generated image and the textual difference prompt (y2diff).
- Final Training Loss: Combines these components, weighting them based on timestep t (Eq. \ref{eqn:final_loss}).
- Masked Inference:
- An optional inference technique is proposed to improve background preservation.
- At each DDIM sampling step t, the predicted denoised latent (z~t) is blended with the noisy source latent (zt) using the mask:
z~t=M⊙zt+M⊙z~t (Eq. \ref{eqn:mask_inf})
- This progressively re-introduces source image details into the unedited regions.
Implementation Considerations:
- Base Model: Fine-tunes a pre-trained Stable Diffusion v1.4 LDM.
- Computational Cost: Training was feasible on constrained hardware (2x 16GB V100 GPUs) by fine-tuning only input/middle U-Net layers alternately. Fine-tuning took ~10,000 steps.
- Inference Speed: ~10 seconds per image for 4 results on a V100.
- Dependencies: Requires models like BLIP (captioning) and CLIPSeg (mask generation), along with standard diffusion model libraries.
Evaluation:
- Compared against SDEdit, DALL-E 2 (with manual masks), and InstructPix2Pix on both generated and real images.
- Metrics included CLIPScore (text alignment), FID (image quality), SSIM-M (edit region similarity), and SSIM-M (background preservation).
- Results show iEdit (especially with masked inference, iEdit-M) achieves a strong balance between edit accuracy (high CLIPScore) and background preservation (high SSIM-M), outperforming baselines in several aspects. Ablation studies confirmed the value of the proposed dataset, mask-based losses, and masked inference.
In essence, iEdit provides a practical framework for text-guided image editing by leveraging weak supervision through automatically generated paired data and incorporating semantic masks to achieve localized control, all while being trainable with moderate computational resources.