iEdit: Localised Text-guided Image Editing with Weak Supervision (2305.05947v1)

Published 10 May 2023 in cs.CV

Abstract: Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely \texttt{iEdit}, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose to automatically construct a dataset derived from LAION-5B, containing pseudo-target images with their descriptive edit prompts given input image-caption pairs. This dataset gives us the flexibility of introducing a weakly-supervised loss function to generate the pseudo-target image from the latent noise of the source image conditioned on the edit prompt. To encourage localised editing and preserve or modify spatial structures in the image, we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Our model is trained on the constructed dataset with 200K samples and constrained GPU resources. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.

Authors (6)

Rumeysa Bodur (6 papers)
Erhan Gundogdu (9 papers)
Binod Bhattarai (60 papers)
Tae-Kyun Kim (91 papers)
Michael Donoser (3 papers)
Loris Bazzani (14 papers)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces an automatic dataset construction method producing ~200K image, edit prompt, and pseudo-target triplets for weak supervision.
It leverages latent diffusion models fine-tuned with paired, masked, and perceptual losses to ensure localized edits that preserve background fidelity.
Evaluation shows iEdit outperforms baselines like SDEdit and DALL-E 2 with improved text alignment and quality metrics via quantitative analysis.

This paper introduces iEdit, a method for localized, text-guided image editing using Latent Diffusion Models (LDMs). The core problem addressed is the difficulty in controlling diffusion models to make specific edits based on text prompts while preserving the rest of the image, especially given the lack of large-scale datasets containing source images, edit prompts, and corresponding target images.

Key Contributions and Implementation Details:

Automatic Paired Dataset Construction:
- To overcome the lack of supervised data, the authors propose a method to automatically generate a dataset of (source image, edit prompt, pseudo-target image) triplets.
- Process:
1. Start with image-caption pairs from a large dataset (LAION-5B). 2. Generate a simpler caption for the source image using BLIP. 3. Create an "edit prompt" by programmatically modifying the BLIP caption (e.g., replacing nouns or adjectives with antonyms or co-hyponyms using WordNet). 4. Find a "pseudo-target" image by retrieving the nearest neighbor image from the dataset using the mean of the CLIP embeddings of the source image and the edit prompt. * Outcome: This yields a dataset (approx. 200K samples used) for weakly-supervised training, avoiding manual annotation costs.
Weakly-Supervised Editing Framework:
- Foundation: Built upon Latent Diffusion Models (LDMs), which operate in a compressed latent space for efficiency.
- Training Objective: The model is fine-tuned to predict the noise required to transform a noisy version of the source image's latent ( $z_t$ ) into the pseudo-target image's latent ( $z_2$ ), guided by the edit prompt ( $y_2$ ).
- Loss Function ( $\mathcal{L}_{paired}$ ): Minimizes the L2 distance between the noise predicted by the U-Net ( $\epsilon_{\theta}$ ) and the calculated ground truth noise ( $\epsilon_2$ ) needed to obtain $z_2$ from $z_t$ .
  
  $\mathcal{L}_{paired} = \mathbb{E}[||\epsilon_2 - \epsilon_{\theta}(z_{t},t,\tau_{\theta}(y_{2}))||^2]$
  
  where $z_t$ is derived from the source latent $z_1$ , and $\epsilon_2$ is calculated using the target latent $z_2$ (Eq. \ref{eqn:ief_gt_noise}).
- CLIP Guidance: A global CLIP loss ( $\mathcal{L}_{global}$ ) is added to ensure the final generated image aligns semantically with the edit prompt. The weights for $\mathcal{L}_{paired}$ and $\mathcal{L}_{global}$ are adjusted based on the diffusion timestep $t$ .
Location Awareness with Masks:
- To encourage edits only in relevant regions and preserve background fidelity, the method incorporates segmentation masks during training (and optionally inference).
- Mask Generation: Masks ( $M_1, M_2$ ) for source and pseudo-target images are generated automatically using CLIPSeg, conditioned on the textual difference between the source caption and the edit prompt.
- Masked Loss ( $\mathcal{L}_{mask}$ ): Replaces $\mathcal{L}_{paired}$ $L_{p ai re d}$ . It computes the loss differently for foreground (masked) and background (inverse masked) regions.
  - Foreground: Predicts target noise ( $\epsilon_2$ ) within the target mask ( $M_2$ ).
  - Background: Predicts source noise ( $\epsilon_1$ ) within the inverse source mask ( $\overline{M_1}$ ). $\mathcal{L}_{mask} = \mathbb{E}[\mathcal{L}_{mask}^{fg}] + \mathbb{E}[\mathcal{L}_{mask}^{bg}]$ (See Eq. \ref{eqn:ldm_simple_loss_mask_main} and \ref{eqn:ldm_simple_loss_mask})
- Additional Masked Losses:
  - Perceptual Loss ( $\mathcal{L}_{perc}$ ): Encourages visual similarity between the masked regions of the generated image and the pseudo-target image using VGG features.
  - Localized CLIP Loss ( $\mathcal{L}_{loc}$ ): Enforces semantic alignment between the masked region of the generated image and the textual difference prompt ( $y^{diff}_2$ ).
- Final Training Loss: Combines these components, weighting them based on timestep $t$ (Eq. \ref{eqn:final_loss}).
Masked Inference:
- An optional inference technique is proposed to improve background preservation.
- At each DDIM sampling step $t$ , the predicted denoised latent ( $\tilde{z}_t$ ) is blended with the noisy source latent ( $z_t$ ) using the mask: $\tilde{z}_t = \overline{M} \odot z_t + M \odot \tilde{z}_t$ (Eq. \ref{eqn:mask_inf})
- This progressively re-introduces source image details into the unedited regions.

Implementation Considerations:

Base Model: Fine-tunes a pre-trained Stable Diffusion v1.4 LDM.
Computational Cost: Training was feasible on constrained hardware (2x 16GB V100 GPUs) by fine-tuning only input/middle U-Net layers alternately. Fine-tuning took ~10,000 steps.
Inference Speed: ~10 seconds per image for 4 results on a V100.
Dependencies: Requires models like BLIP (captioning) and CLIPSeg (mask generation), along with standard diffusion model libraries.

Evaluation:

Compared against SDEdit, DALL-E 2 (with manual masks), and InstructPix2Pix on both generated and real images.
Metrics included CLIPScore (text alignment), FID (image quality), SSIM- $M$ (edit region similarity), and SSIM- $\overline{M}$ (background preservation).
Results show iEdit (especially with masked inference, iEdit- $M$ ) achieves a strong balance between edit accuracy (high CLIPScore) and background preservation (high SSIM- $\overline{M}$ ), outperforming baselines in several aspects. Ablation studies confirmed the value of the proposed dataset, mask-based losses, and masked inference.

In essence, iEdit provides a practical framework for text-guided image editing by leveraging weak supervision through automatically generated paired data and incorporating semantic masks to achieve localized control, all while being trainable with moderate computational resources.

PDF Markdown

iEdit: Localised Text-guided Image Editing with Weak Supervision (2305.05947v1)

Summary

Related Papers

Tweets