BrushBench: Inpainting Evaluation Suite

Updated 21 October 2025

BrushBench is a comprehensive benchmark suite designed to assess inpainting algorithms across image quality, masked region reconstruction, and text-image semantic alignment.
It employs specialized metrics such as IR, PSNR, LPIPS, and CLIP similarity to quantify both qualitative and quantitative aspects of generative performance.
The benchmark supports multi-task training strategies and dual-branch architectures, enhancing semantic, aesthetic, and structural fidelity in inpainted images.

BrushBench is a comprehensive benchmark suite used for the evaluation of object inpainting models, with specific emphasis on the semantic alignment between generated image content and textual prompts, as well as structural and stylistic consistency. In cutting-edge generative modeling, especially within diffusion-based inpainting, BrushBench has established itself as a pivotal resource for quantifying both qualitative and quantitative aspects of performance.

1. Benchmark Scope and Assessment Criteria

BrushBench is designed to dissect the capabilities of inpainting algorithms across three principal axes: image quality, masked region reconstruction, and text-image semantic alignment. Evaluation protocols mandate rigorous measurement using specialized metrics tailored for each facet of inpainting:

Image Quality
- Image Reward (IR): An aesthetic measure, scaled by a factor of 10, concentrating on differences in qualitative perception of outputs.
- Aesthetic Score (AS): Tracks the overall visual appeal, reflecting higher-level perceptual judgments of image fidelity.
Masked Region Preservation
- PSNR (Peak Signal-to-Noise Ratio): Quantifies reconstruction quality of the inpainted region; higher PSNR signifies improved signal fidelity.
- LPIPS (Learned Perceptual Image Patch Similarity): Assesses perceptual similarity to the original image, scaled by $10^3$ for emphasis; lower values indicate less perceptual deviation.
- MSE (Mean Squared Error): Captures mean difference from ground truth within the masked area; lower MSE reflects higher precision.
Semantic Consistency (Text Alignment)
- CLIP Similarity: Computes alignment between generated visual content and corresponding text prompt, leveraging contrastive vision-language embeddings.
- VQA Score: Measures specific correspondence between the generated masked region and the prompt, offering a localized semantic alignment assessment.

This multifactorial evaluation is integral for models seeking not only to produce realistic images but also to ensure semantic and stylistic harmonization within edits.

2. Quantitative Performance and Comparative Results

Empirical results on BrushBench are foundational for establishing state-of-the-art inpainting. For example, in the evaluation of MTADiffusion, the following quantitative outcomes were recorded:

Model	IR	AS	PSNR	LPIPS	MSE
MTADiffusion	12.69	6.50	31.87	18.94	0.80
SDI	–	–	–	–	–
CNI	–	–	–	–	–
PP	–	–	–	–	–
BrushNet	–	–	–	–	–

MTADiffusion attained superior scores in all BrushBench metrics compared to SDI, CNI, PP, and BrushNet. The local VQA and CLIP similarity measures further underscored the model's ability to produce semantically congruent inpainted regions, as well as high perceptual quality. The explicit inclusion of IR and AS facilitated evaluation of both purely visual and higher-level generative capabilities.

3. Advances Enabled by MTAPipeline and MTADataset

The architecture and evaluation protocol of MTADiffusion highlight BrushBench's suitability for systematically assessing nuanced model advances based on data construction and annotation depth:

The MTAPipeline leverages Grounded-SAM for extracting masks, labels, and bounding boxes, followed by LLaVA for mask-wise content and style annotation. This pipeline produces mask-text pairs with high semantic density, exceeding the descriptive fidelity of whole-image captions or simplistic semantic labels.
The resulting MTADataset (5 million images, 25 million mask-text pairs) equips models trained and tested on BrushBench with richer supervision, allowing for more robust generalization and sharper semantic alignment. This suggests BrushBench, when used in conjunction with such datasets, will particularly accentuate differences arising from annotation granularity.

4. Model Architecture and Loss Formulations in Context

When evaluated on BrushBench, architectures such as MTADiffusion employ dual-branch designs:

Standard UNet Branch: Handles the core inpainting process.
Brush Branch: Incorporates multi-resolution self-attention blocks for contextualized reconstruction, with global image information tightly integrated.

Their interaction is mathematically encoded via a “zero convolution” operation:

$\epsilon_\theta(z_t, t, C)_j = \epsilon_\theta(z_t, t, C)_j + w \cdot \mathcal{Z}\left(\epsilon_\theta^{\text{attn}}_j\left([z_t, z_0^{(\text{masked})}, m^{(\text{resized})}], t\right)_j\right)$

where $\mathcal{Z}$ denotes zero convolution, $w$ a hyperparameter, $z_t$ the noisy latent, $z_0^{(\text{masked})}$ the latent of the masked image, and $m^{(\text{resized})}$ the resized mask latent.

Style Consistency Loss:

A VGG network extracts hierarchical style features, with loss enforced in Gram matrix space:

$\mathcal{L}_{\text{style}} = \frac{1}{BN} \sum_{i=1}^B \sum_{j=1}^N \| G(\alpha_j) - G(\beta_j) \|_F^2$

where $G(\cdot)$ computes the Gram matrix, $\alpha_j$ and $\beta_j$ are the style embeddings of the generated and ground-truth images, respectively. This loss penalizes stylistic incongruity in output.

5. Multi-Task Training Strategy and Structural Stability

BrushBench is particularly germane for testing inpainting models emphasizing structure preservation. MTADiffusion adopts joint training on inpainting and edge prediction, extending the brush branch for edge map output and optimizing the structural objective:

$\mathcal{L}_{\text{structure}} = \frac{1}{B} \sum_{i=1}^B \left\| s_{\text{pred}}^i - \tilde{s}^i \right\|_F^2$

with $s_{\text{pred}}^i$ as the network's edge prediction and $\tilde{s}^i$ as a downsampled ground-truth edge map generated by a Sobel operator. This setup encourages models to retain object boundaries and content integrity under significant transformations. The resulting improvements in BrushBench VQA and structural metrics reflect the efficacy of this dual-objective paradigm.

6. Context, Significance, and Interpretive Considerations

BrushBench serves as a robust evaluation environment exposing the limits and advances of object inpainting models. Its comprehensive metric set—encompassing visual qualities, reconstruction fidelity, and semantic alignment—enables precise attribution of a method's capabilities and limitations. The benchmark's design aligns with contemporary research imperatives around controllable, prompt-driven generative editing; models tested on BrushBench are compelled to demonstrate not only reconstruction skill but rigorous semantic and stylistic fidelity. A plausible implication is that further model advances—particularly those enabled by more granular annotation protocols or multi-objective optimization—will be detectable and quantifiable through BrushBench benchmarks.

BrushBench's integration within the evaluation stack of generative models such as MTADiffusion has established it as a standard for state-of-the-art claims concerning inpainting quality, semantic congruence, and structural realism (Huang et al., 30 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting (2025)

Follow Topic

Get notified by email when new papers are published related to BrushBench.