Diffree: Deep Text-Guided Object Inpainting
- Diffree is a deep generative image editing architecture that enables text-guided, shape-free object inpainting without manual masks or bounding boxes.
- It extends a pre-trained Stable Diffusion model with an Object Mask Predictor in latent space to seamlessly integrate new objects based solely on text prompts.
- Validation on OABench, COCO, and OpenImages shows high success rates and low LPIPS, offering robust applications in content creation and iterative image editing.
Diffree is a deep generative image editing architecture designed specifically for text-guided, shape-free object inpainting—enabling users to add new objects to images solely by text prompts, without relying on bounding boxes, user-marked masks, or other manual spatial inputs. Built upon a modified Stable Diffusion framework and a curated large-scale object addition benchmark (OABench), it achieves seamless synthesis of novel objects integrated with consistent scene context.
1. Architectural Innovations
Diffree extends a pre-trained Stable Diffusion model by incorporating a dedicated Object Mask Predictor (OMP) module. The architecture operates in the latent space:
- Latent-conditional denoising process: Given an inpainted input image (with object removed), its latent (where is the VAE encoder) is concatenated with the corrupted target latent at each diffusion step.
- Text control: The model receives an encoded text prompt describing the object to be added.
- Modified U-Net: The first convolutional layer is extended to absorb both the background (latent ) and diffusion noise pathway, providing joint context for image reconstruction and object synthesis.
- Object Mask Predictor (OMP): At an early reverse diffusion stage, Diffree uses both the denoised estimate and to predict a binary mask indicating where the new object should be generated, entirely from text and image context. The OMP is a stack of convolutional, residual, and attention blocks generating spatial probability maps.
Joint loss function: where is the diffusion mean squared error on noise prediction and is the mask prediction loss; is empirically set (e.g., 2). Classifier-free guidance is employed to balance unconditional and conditional sampling.
2. The OABench Dataset
OABench is a 74K synthetic object addition corpus specifically curated for training and benchmarking text-only object inpainting:
- Samples: Each instance consists of (i) an original image with an object, (ii) a high-quality inpainted image (object removed using a method such as PowerPaint), (iii) an object mask (from segmentation datasets like COCO/LVIS), and (iv) a natural language object description.
- Construction: Rigorous filtering selects objects with reasonable size, completeness, and aspect ratio; inpainting and CLIP-based filtering ensure background realism and semantic alignment.
- Supervision: During training, the model learns to conditionally map from the inpainted image + text → original image, optimizing both for correct object placement and preservation of background context.
3. Training Protocol and Dynamics
Diffree is initialized from Stable Diffusion (v1.5) checkpoints, then fine-tuned with the OMP module jointly end-to-end on OABench tuples:
- At each iteration, the model is presented with an inpainted image and text; random classifier-free guidance blanks conditions at 5% probability.
- The OMP receives the current denoised estimate and background latent, outputting a soft mask; supervision is against ground truth object masks.
- The diffusion model receives both the noisy latent and the concatenated background latent at each step, guiding denoising with both context and predicted mask.
- Training explicitly encourages compatibility between the generated object, its location, the text description, and global visual consistency.
4. Quantitative Evaluation and Metrics
Experiments on COCO and OpenImages datasets demonstrate clear advantages:
Metric | Diffree (COCO) | InstructPix2Pix (COCO) | Diffree (OpenImages) |
---|---|---|---|
Success Rate (%) | 98.5 | 59.0 | 98.0 |
LPIPS (lower is better) | lower | higher | lower |
Local CLIP Score | higher | lower | higher |
Local FID | lower | higher | lower |
- LPIPS [perceptual similarity]: Quantifies preservation of the original background.
- Local CLIP Score: Cosine similarity between text and generated object's visual embedding.
- Local FID: Assesses realism and diversity of added object content.
- Location Reasonableness: Automated GPT-4V-based rating of whether the model’s object placement is plausible given the input context and text.
- Unified Metric: Aggregates normalized scores to facilitate holistic comparison.
Diffree achieves top or highly competitive scores across all benchmarks, maintaining high rates of both visual consistency and semantic relevance.
5. Methodological Comparison
Method | Text Input | Manual Mask Needed | Automatic Object Placement | Background Preservation | Object-Context Consistency |
---|---|---|---|---|---|
Diffree | ✓ | × | ✓ | ✓ | ✓ |
Text-Guided Inpainting | ✓ | ✓ | × | variable | variable |
InstructPix2Pix | ✓ | × | × | lower | lower |
Diffree eliminates the need for bounding boxes, scribbled masks, or spatial hints by integrating the OMP into the denoising process. Competing methods typically either require manual annotation or fail to resolve background/object context without artifacts.
6. Use Cases and Implications
Applications include:
- Content creation: Advertisement, e-commerce photography, interior design, and entertainment—where seamless object insertion is crucial and minimal user intervention is desirable.
- Iterative image editing: The mask predictor allows multi-step editing (sequential object additions) without excessive background degradation.
- Foundational models: By decoupling the inpainting problem from explicit shape guidance, Diffree enables integration with broader planning or multimodal reasoning frameworks.
Broader methodological implications include:
- Demonstrating the feasibility of spatial reasoning via mask prediction inside diffusion models using only text as external input.
- Providing a scalable solution for learning-based spatial control in generative modeling pipelines.
7. Prospective Directions
Natural extensions include:
- Enhanced composition: Adding multiple objects or generating scene layouts purely from text narratives.
- User-guided refinement: Optionally allowing textual constraints on object count, size, or grouping.
- Cross-modal interaction: Leveraging vision-language foundation models (e.g., GPT-4V, CLIP) to further improve location reasonableness, object specificity, or scene understanding.
In summary, Diffree establishes a new paradigm for text-driven, shape-free object inpainting—shifting the research focus towards integrated, context-aware generative architectures with practical relevance for image composition workflows (Zhao et al., 24 Jul 2024).