BrushData-v2: Object-Driven Inpainting Dataset

Updated 5 January 2026

BrushData-v2 is a large-scale, object-driven image inpainting dataset that uses real-object semantic masks generated via Grounded-SAM and stringent filtering criteria.
The dataset comprises 200,000 image/mask pairs across diverse resolutions and object categories, enabling robust segmentation-based training of diffusion models.
BrushData-v2 supports rigorous evaluation with multiple perceptual and reconstruction metrics, offering practical insights for advancing generative inpainting research.

BrushData-v2 is a large-scale, object-driven image inpainting dataset introduced to support segmentation-based training and evaluation of diffusion models for image restoration tasks. Constructed as part of the BrushNet framework, BrushData-v2 provides high-quality, open-world semantic masks derived from natural images, enabling rigorous inpainting benchmarks and methodological advances in generative modeling (Ju et al., 2024).

1. Construction Methodology

BrushData-v2 utilizes the LAION-Aesthetic subset of the LAION-5B corpus, emphasizing high-quality (“aesthetic”) image-text pairs as input. Semantic mask generation is performed using Grounded-SAM (Grounded segment-anything), which processes each image and associated LAION caption to produce open-world object segments. Each mask includes a confidence score, enabling automated filtering: only masks with confidence ≥0.8 are retained. Additional rejection criteria include exclusion of masks occupying less than 1% of the image area and masks exhibiting disconnected “islands.” The retained masks are resized via cubic interpolation to match the latent resolution used in Stable Diffusion (typically $64\times64$ ). RGB images are center-cropped (or padded) with long edges capped at $1024$ px and subsequently encoded into $4\times$ downsampled VAE latents ( $256\times256$ to $64\times64$ ). Random subsets undergo manual spot checks to further improve dataset quality.

2. Dataset Scale, Annotation, and Versioning

BrushData-v2 contains 200,000 image/mask pairs, with original image resolutions distributed approximately uniformly across $256\times256$ , $512\times512$ , $768\times768$ , and $1024\times1024$ pixels, ultimately mapped to $64\times64$ VAE latent grids. The semantic categories, derived from LAION captions and Grounded-SAM outputs, include approximately 25,000 humans/faces, 50,000 animals, 60,000 indoor scenes, and 65,000 outdoor scenes. Each image typically contains 1–3 single-channel binary PNG masks, distinguishing object interiors, exteriors, and multi-part segmentations.

Compared to BrushData-v1—which used synthetic, randomly drawn brush masks over LAION images—v2 introduces realistic object masks and expands coverage to thousands of object classes. Statistical comparisons are summarized below:

	#Images	Avg. mask area	#Categories	Avg. mask islands
v1	100,000	23%	1	1
v2	200,000	29%	>1,000	1.1

BrushData-v2 employs automated confidence filtering, area/continuity-based rejection, and spot manual annotation to achieve higher annotation accuracy.

3. Data Organization and Access

The dataset is organized for efficient usage and compatibility with generative inpainting workflows:

Folder structure:
- images/: Cropped RGB images in .jpg or .png format.
- masks/: Downsampled single-channel PNG masks.
- latents/: $4\times$ VAE-encoded image latents as .pt or .npy files.
- annotations.jsonl: JSON Lines file, each object encapsulating metadata.
Example schema:

{
  "image_id": "00012345",
  "file_name": "images/00012345.jpg",
  "latent_file": "latents/00012345.npy",
  "mask_file": "masks/00012345.png",
  "caption": "a tabby cat sitting on a wooden floor",
  "objects": [
    { "label": "cat", "confidence": 0.92 },
    { "label": "floor", "confidence": 0.85 }
  ]
}

Each record includes LAION captions, object labels, and confidence scores, facilitating fine-grained retrieval and analysis.

4. Segmentation-Guided Inpainting Training Protocols

BrushData-v2 is optimized for segmentation-driven inpainting using dual-branch diffusion models. Model input includes the VAE latent of the masked image $z_0^{mask}$ (masked-out pixels zeroed), the binary mask resized to the latent grid $m^{resize}$ , and the noise latent $z_t$ , constructed as $[z_t ; z_0^{mask} ; m^{resize}] \in \mathbb{R}^{C\times H \times W}$ . Training employs feature injection via dual branching [Eq. 6 in the cited paper]:

$\epsilon_{t\theta}^{main}(z_t, t, C)_i \leftarrow \epsilon_{t\theta}^{main}(z_t, t, C)_i + w \cdot \mathcal{Z}( \epsilon_{t\theta}^{BrushNet}([z_t, z_0^{mask}, m^{resize}], t)_i )$

where $\mathcal{Z}$ denotes a zero-initialized $1 \times 1$ convolution and $w$ is the preservation scale. The training objective follows the standard diffusion reconstruction loss:

$L(\theta) = \mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0, I), t \sim U[1, T]} \left\|\epsilon - \epsilon_{t\theta}(z_t, t, C)\right\|_2^2$

Data augmentation includes random horizontal flips, color jittering pre-encoding, and random mask dilation/erosion (±2 pixels) during early epochs.

5. Evaluation Protocol on BrushBench

Performance is quantitatively evaluated using BrushBench, which employs seven complementary metrics:

PSNR (unmasked region): $PSNR = 10 \cdot \log_{10}( MAX_I^2 / MSE )$ , $MSE = \frac{1}{N} \sum_n (I_n - \hat{Y}_n)^2$
LPIPS: $LPIPS(x, y) = \sum_l \| \phi_l(x) - \phi_l(y) \|_2$ with pretrained VGG features
CLIP Similarity: $\mathrm{CLIPSim}(I, T) = \frac{\langle f_{img}(I), f_{text}(T) \rangle}{\|f_{img}(I)\| \cdot \|f_{text}(T)\|}$
ImageReward (IR), Human Preference Score (HPS), Aesthetic Score (AS): learned perceptual metrics

Average performance (Stable Diffusion v1.5 backbone):

Inpainting Mode	IR	HPS	AS	PSNR	LPIPS	MSE	CLIP
Inside	12.64↑	27.78↑	6.51↑	31.94↑	0.0080↓	18.67↓	26.39↑
Outside	10.88↑	28.09↑	6.64↑	27.82↑	0.0225↓	4.63↓	27.22↑

Head-to-head comparisons with baselines (e.g., BLD, SDI, HDP, PP, CNI) affirm that models trained on BrushData-v2 consistently rank first or second across all seven metrics.

6. Practical Guidance and Best Practices

Recommended training regime comprises 430,000 iterations on $8 \times$ V100 GPUs (~3 days), batch size of 16, learning rate $10^{-5}$ (no weight decay), text guidance scale of 7.5 at inference with 50 DDIM steps, and preservation scale $w$ initiated at 1.0 and reduced to 0.2–0.5 for coarser style control.

Common pitfalls include mask resizing artifacts (necessitating cubic interpolation and post-blur), loss of high-frequency detail in VAE reconstruction (mitigated by blurred-mask pixel-space copy-paste), and text-image mismatch (text cross-attention disabled in the BrushNet branch for purity of masked feature embedding). Domain-shift issues arise if switching the base model checkpoint; for example, domain adaptation toward anime can affect unmasked region style.

A plausible implication is that BrushData-v2, through its emphasis on real-object segmentation and rigorous annotation protocols, addresses critical shortcomings of synthetic mask datasets, catalyzing robust inpainting research and more reliable evaluation (Ju et al., 2024).

PDF Markdown Chat (Pro)

References (1)

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BrushData-v2.