Papers
Topics
Authors
Recent
2000 character limit reached

BrushData-v2: Object-Driven Inpainting Dataset

Updated 5 January 2026
  • BrushData-v2 is a large-scale, object-driven image inpainting dataset that uses real-object semantic masks generated via Grounded-SAM and stringent filtering criteria.
  • The dataset comprises 200,000 image/mask pairs across diverse resolutions and object categories, enabling robust segmentation-based training of diffusion models.
  • BrushData-v2 supports rigorous evaluation with multiple perceptual and reconstruction metrics, offering practical insights for advancing generative inpainting research.

BrushData-v2 is a large-scale, object-driven image inpainting dataset introduced to support segmentation-based training and evaluation of diffusion models for image restoration tasks. Constructed as part of the BrushNet framework, BrushData-v2 provides high-quality, open-world semantic masks derived from natural images, enabling rigorous inpainting benchmarks and methodological advances in generative modeling (Ju et al., 2024).

1. Construction Methodology

BrushData-v2 utilizes the LAION-Aesthetic subset of the LAION-5B corpus, emphasizing high-quality (“aesthetic”) image-text pairs as input. Semantic mask generation is performed using Grounded-SAM (Grounded segment-anything), which processes each image and associated LAION caption to produce open-world object segments. Each mask includes a confidence score, enabling automated filtering: only masks with confidence ≥0.8 are retained. Additional rejection criteria include exclusion of masks occupying less than 1% of the image area and masks exhibiting disconnected “islands.” The retained masks are resized via cubic interpolation to match the latent resolution used in Stable Diffusion (typically 64×6464\times64). RGB images are center-cropped (or padded) with long edges capped at $1024$ px and subsequently encoded into 4×4\times downsampled VAE latents (256×256256\times256 to 64×6464\times64). Random subsets undergo manual spot checks to further improve dataset quality.

2. Dataset Scale, Annotation, and Versioning

BrushData-v2 contains 200,000 image/mask pairs, with original image resolutions distributed approximately uniformly across 256×256256\times256, 512×512512\times512, 768×768768\times768, and 1024×10241024\times1024 pixels, ultimately mapped to 64×6464\times64 VAE latent grids. The semantic categories, derived from LAION captions and Grounded-SAM outputs, include approximately 25,000 humans/faces, 50,000 animals, 60,000 indoor scenes, and 65,000 outdoor scenes. Each image typically contains 1–3 single-channel binary PNG masks, distinguishing object interiors, exteriors, and multi-part segmentations.

Compared to BrushData-v1—which used synthetic, randomly drawn brush masks over LAION images—v2 introduces realistic object masks and expands coverage to thousands of object classes. Statistical comparisons are summarized below:

#Images Avg. mask area #Categories Avg. mask islands
v1 100,000 23% 1 1
v2 200,000 29% >1,000 1.1

BrushData-v2 employs automated confidence filtering, area/continuity-based rejection, and spot manual annotation to achieve higher annotation accuracy.

3. Data Organization and Access

The dataset is organized for efficient usage and compatibility with generative inpainting workflows:

  • Folder structure:
    • images/: Cropped RGB images in .jpg or .png format.
    • masks/: Downsampled single-channel PNG masks.
    • latents/: 4×4\times VAE-encoded image latents as .pt or .npy files.
    • annotations.jsonl: JSON Lines file, each object encapsulating metadata.
  • Example schema:

1
2
3
4
5
6
7
8
9
10
11
{
  "image_id": "00012345",
  "file_name": "images/00012345.jpg",
  "latent_file": "latents/00012345.npy",
  "mask_file": "masks/00012345.png",
  "caption": "a tabby cat sitting on a wooden floor",
  "objects": [
    { "label": "cat", "confidence": 0.92 },
    { "label": "floor", "confidence": 0.85 }
  ]
}
Each record includes LAION captions, object labels, and confidence scores, facilitating fine-grained retrieval and analysis.

4. Segmentation-Guided Inpainting Training Protocols

BrushData-v2 is optimized for segmentation-driven inpainting using dual-branch diffusion models. Model input includes the VAE latent of the masked image z0maskz_0^{mask} (masked-out pixels zeroed), the binary mask resized to the latent grid mresizem^{resize}, and the noise latent ztz_t, constructed as [zt;z0mask;mresize]RC×H×W[z_t ; z_0^{mask} ; m^{resize}] \in \mathbb{R}^{C\times H \times W}. Training employs feature injection via dual branching [Eq. 6 in the cited paper]:

ϵtθmain(zt,t,C)iϵtθmain(zt,t,C)i+wZ(ϵtθBrushNet([zt,z0mask,mresize],t)i)\epsilon_{t\theta}^{main}(z_t, t, C)_i \leftarrow \epsilon_{t\theta}^{main}(z_t, t, C)_i + w \cdot \mathcal{Z}( \epsilon_{t\theta}^{BrushNet}([z_t, z_0^{mask}, m^{resize}], t)_i )

where Z\mathcal{Z} denotes a zero-initialized 1×11 \times 1 convolution and ww is the preservation scale. The training objective follows the standard diffusion reconstruction loss:

L(θ)=Ez0,ϵN(0,I),tU[1,T]ϵϵtθ(zt,t,C)22L(\theta) = \mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0, I), t \sim U[1, T]} \left\|\epsilon - \epsilon_{t\theta}(z_t, t, C)\right\|_2^2

Data augmentation includes random horizontal flips, color jittering pre-encoding, and random mask dilation/erosion (±2 pixels) during early epochs.

5. Evaluation Protocol on BrushBench

Performance is quantitatively evaluated using BrushBench, which employs seven complementary metrics:

  • PSNR (unmasked region): PSNR=10log10(MAXI2/MSE)PSNR = 10 \cdot \log_{10}( MAX_I^2 / MSE ), MSE=1Nn(InY^n)2MSE = \frac{1}{N} \sum_n (I_n - \hat{Y}_n)^2
  • LPIPS: LPIPS(x,y)=lϕl(x)ϕl(y)2LPIPS(x, y) = \sum_l \| \phi_l(x) - \phi_l(y) \|_2 with pretrained VGG features
  • CLIP Similarity: CLIPSim(I,T)=fimg(I),ftext(T)fimg(I)ftext(T)\mathrm{CLIPSim}(I, T) = \frac{\langle f_{img}(I), f_{text}(T) \rangle}{\|f_{img}(I)\| \cdot \|f_{text}(T)\|}
  • ImageReward (IR), Human Preference Score (HPS), Aesthetic Score (AS): learned perceptual metrics

Average performance (Stable Diffusion v1.5 backbone):

Inpainting Mode IR HPS AS PSNR LPIPS MSE CLIP
Inside 12.64↑ 27.78↑ 6.51↑ 31.94↑ 0.0080↓ 18.67↓ 26.39↑
Outside 10.88↑ 28.09↑ 6.64↑ 27.82↑ 0.0225↓ 4.63↓ 27.22↑

Head-to-head comparisons with baselines (e.g., BLD, SDI, HDP, PP, CNI) affirm that models trained on BrushData-v2 consistently rank first or second across all seven metrics.

6. Practical Guidance and Best Practices

Recommended training regime comprises 430,000 iterations on 8×8 \times V100 GPUs (~3 days), batch size of 16, learning rate 10510^{-5} (no weight decay), text guidance scale of 7.5 at inference with 50 DDIM steps, and preservation scale ww initiated at 1.0 and reduced to 0.2–0.5 for coarser style control.

Common pitfalls include mask resizing artifacts (necessitating cubic interpolation and post-blur), loss of high-frequency detail in VAE reconstruction (mitigated by blurred-mask pixel-space copy-paste), and text-image mismatch (text cross-attention disabled in the BrushNet branch for purity of masked feature embedding). Domain-shift issues arise if switching the base model checkpoint; for example, domain adaptation toward anime can affect unmasked region style.

A plausible implication is that BrushData-v2, through its emphasis on real-object segmentation and rigorous annotation protocols, addresses critical shortcomings of synthetic mask datasets, catalyzing robust inpainting research and more reliable evaluation (Ju et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BrushData-v2.