OABench: Text-Guided Inpainting Dataset

Updated 29 August 2025

OABench is a synthetic dataset comprising ~74,000 quadruple-tuples with original images, inpainted images, object masks, and textual labels.
It employs rigorous curation—including filtering, advanced inpainting, and CLIP-based quality control—to ensure realistic background fidelity and semantic alignment.
It underpins training of diffusion-based inpainting models like Diffree, achieving high success rates and precise, context-aware object synthesis.

OABench is a synthetic dataset tailored for text-guided object addition in images, targeting the inpainting task where only a textual description provides control over the object to be added. Developed as the central resource for training the Diffree model, OABench enables high-fidelity object synthesis and seamless integration of new objects into complex image backgrounds while preserving context such as lighting, texture, and spatial arrangement.

1. Dataset Curation and Construction

OABench is sourced from large-scale, real-world instance segmentation datasets—including COCO and LVIS—which provide annotated images, object masks, and semantic labels. The curation follows a rigorous three-step procedure:

Collection and Filtering: Objects are selected from the instance segmentation datasets with strict filtering rules. Objects that are too small or large are eliminated, mitigating inclusion of irrelevant instances (e.g., tiny buttons or dominating backgrounds). Additional filters—such as edge detection and cavity examination—remove incomplete or partially obscured objects, yielding a corpus of high-quality, well-defined masks.
Data Synthesis: Advanced inpainting (using PowerPaint) removes the chosen object from each image, synthesizing a realistic "background-only" image. The original image, the inpainted image, the object’s mask, and its description constitute a compact tuple.
Post-Processing: A CLIP score is computed for the inpainted region using the object’s description to verify background consistency and semantic alignment. Tuples failing to meet the quality threshold are discarded. This process results in a final dataset of approximately 74,000 quadruple-tuples, each composed of:

| Original Image | Inpainted Image | Object Mask | Object Description | |----------------|----------------|-------------|--------------------| | with object | object removed | binary mask | text label |

This systematic approach ensures that OABench achieves a high level of background fidelity and semantic relevance, essential for robust inpainting model training.

2. Structure of Each Dataset Tuple

Each tuple comprises four elements:

The complete, original image,
A synthetic background generated via object removal,
A binary mask indicating the location of the removed object,
A textual object description (e.g., "monitor", "pizza slice").

This compositional structure allows a model to learn the correspondence between textual inputs, spatial placement, and photorealistic generation of objects within diverse background contexts.

3. Role in Training Text-Guided Inpainting Models

OABench is instrumental in training diffusion-based inpainting architectures. The dataset is used in a paired manner:

Input: Background-only image and text prompt.
Output: The original image (i.e., with the object present) plus ground-truth object mask.

Diffree incorporates a specialized Object Mask Predictor (OMP) module, utilizing the binary masks from OABench as supervision to learn accurate object placement. The model architecture integrates Stable Diffusion with mask conditioning, optimizing the following loss:

$L_{DM} = \mathbb{E}[\|\epsilon - \epsilon_\theta(\hat{z}_t, z, Enc_{txt}(d), t)\|_2^2]$

where $\hat{z}_t$ is the noisy latent from the inpainted image, $z = E(x)$ is the encoded image latent, $Enc_{txt}(d)$ encodes the textual description, and $t$ is the noise timestep. The overall training objective is:

$L = L_{DM} + \lambda L_{OMP}$

The contribution of OABench is foundational; it provides aligned image–text–mask pairs, directly supporting both supervised training signals and synthetic variety.

4. Dataset Quality Control and Rational Filtering

Quality assurance within OABench is performed by:

Excluding poorly inpainted examples via quantitative CLIP scoring,
Enforcing object size and completeness constraints,
Using advanced background-preserving inpainting (PowerPaint) to minimize semantic artifacts.

This meticulous selection ensures that object addition models trained on OABench do not learn spurious correlations or introduce background distortions, resulting in greater generalization to arbitrary natural scenes.

5. Technical Impact and Metric Evaluation

The reliability and diversity of OABench are reflected in technical metrics assessed on models trained with it. Performance is evaluated using:

LPIPS (Learned Perceptual Image Patch Similarity) for background consistency,
Local CLIP Score for textual–visual alignment within the region of the added object,
Success Rate (reported ~98.5% for Diffree) and unified metric score against baseline approaches.

High success rates and strong metric scores attributed to Diffree are directly traceable to OABench’s well-curated, exhaustive pairing of images, masks, and textual descriptors, which inform both mask localization and final image synthesis.

6. Contextual Significance and Research Implications

OABench enables generative models to add objects into images guided solely by text, without requiring bounding boxes or manual scribble masks. The dataset’s structure—rooted in natural spatial context and verified through semantically meaningful filters—facilitates accurate, context-aware object insertion. Its coverage of object categories and background types empowers iterative inpainting: multiple objects may be added, with each step leveraging priors present in the underlying data.

A plausible implication is that such paired, contextually grounded datasets are critical for advancing text-to-image and image editing models that must reason jointly about semantics, spatial context, and photorealism.

7. Summary Table

Property	Value/Description	Significance
Source datasets	COCO, LVIS (with instance masks)	Ensures diversity and realism
Curation steps	Object selection, advanced inpainting, CLIP-based filtering	Quality and consistency
Final size	~74,000 quadruple-tuples	Sufficient for robust learning
Components	Original image, inpainted image, object mask, object label	Enables end-to-end training
Application	Training Diffree, text-guided shape-free object addition	Advances T2I editing methods

OABench establishes a technical foundation for text-guided image inpainting by combining high-quality segmentation, rigorous post-processing, and context-sensitive pairing, thereby facilitating research and deployment of controlled object addition in generative vision models.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OABench.