3D PixBrush: Localized Texture Synthesis

Updated 20 January 2026

3D PixBrush is a method for image-driven, localized texture synthesis on 3D meshes that automatically identifies mesh regions and applies reference styles without user input.
It utilizes a modified score distillation sampling framework that integrates both text- and image-driven loss terms with an MLP-based mask predictor for precise localization.
The system achieves high-fidelity texture placement with global coherence and local precision, outperforming baselines in metrics such as CLIP R-Precision and LPIPS.

3D PixBrush is a method for image-driven, localized texture synthesis on 3D meshes. It introduces a framework that predicts both a localization mask and a region-specific color field on the mesh surface, enabling automatic placement and high-fidelity transfer of visually complex objects or styles from a reference image to a specified mesh region, all without user marking or region selection. Its architecture relies on modifications to score distillation sampling (SDS) that handle both text- and image-driven loss terms, culminating in a system that is globally contextualized and locally precise (Decatur et al., 4 Jul 2025).

1. Problem Formulation and Objectives

3D PixBrush is designed to address the challenge of transplanting an object—defined jointly by appearance and geometric extent—from a 2D reference image onto a 3D mesh in a localized manner. The system operates on:

Inputs
- A 3D mesh $M$ with UV parameterization; points on the surface are $x \in \mathbb{R}^3$ .
- A reference image $I_{\text{ref}} \in [0,1]^{H \times W \times 3}$ depicting the target object/style.
- A text prompt $y$ (e.g., “a cow with sunglasses”) specifying coarse semantics.
Outputs
- A probability-valued localization mask $p(x) \in [0,1]$ identifying mesh regions for synthesis.
- A synthesized color field $c(x) \in [0,1]^3$ that matches $I_{\text{ref}}$ within the region $p(x) > 0$ .

The compound objective is to autonomously select, localize, and texture mesh regions so that $p(x)$ identifies the reference object’s geometry and $c(x)$ replicates the style/structure of $I_{\text{ref}}$ . No user-provided region hints are required. Optimization is performed jointly over a localization network $F_\theta$ and a texture network $F_\phi$ using mixed text- and image-guided SDS losses (Decatur et al., 4 Jul 2025).

2. Mask Prediction Architecture and Training

The localization mask $p(x)=F_\theta(x)$ is parameterized by a compact multilayer perceptron (MLP) coupled to a periodic positional encoder $\gamma : \mathbb{R}^3 \to \mathbb{R}^{2D}$ as used in NeRF, forming $F_\theta(x) = \sigma(\text{MLP}_\theta(\gamma(x)))$ with $\sigma$ being the sigmoid nonlinearity.

Network details
- Six ($6$) MLP blocks: (fully connected → BatchNorm → ReLU), each width $256$.
- Evaluation per-vertex, per-face, or per-texel as needed.
Losses
1. Text-driven SDS Loss ( $L_{\text{loc}}$ ): Based on rendering the mask via a differentiable renderer $r(\cdot;p)$ , producing image $m$ , and applying SDS with a pretrained diffusion noise predictor $\epsilon_\psi$ . The objective iteratively nudges mask renders to match semantic content inferred from the text prompt.
2. Image-modulated Texture Loss: Propagates gradients through masked $c(x)$ to penalize mismatches between synthesized texture and reference image in the selected region.
3. Smoothing Regularization:
$L_{\text{smooth}} = \lambda_s \int_M \|\nabla_x p(x)\|_1\, dx, \quad \lambda_s \approx 10^{-2}$

enforcing local spatial coherence in the mask to avoid spurious regions.

3. Localization-Modulated Image Guidance Mechanism

The core innovation is "localization-modulated image guidance" (LMIG), which restricts image-conditioned diffusion guidance to only those surface areas selected by the predicted mask:

SDS is generalized to support both text ( $y$ ) and image ( $I_{\text{ref}}$ ) conditions via a joint-conditioned diffusion backbone with IP-Adapter.
The mask $p(x)$ is rendered and binarized to generate $M = 1_{m > \tau}$ .
At every decoder layer $l$ , attention from the reference image tokens, $CA_I$ , is multiplied with $M_l$ (downsampled mask), ensuring spatial correspondence.
Formally, the modified gradient is:

$\nabla_x L_{\text{SDS}_\text{loc}}(x, y, I, M) = \mathbb{E}_{t, \epsilon}[w(t)(\epsilon_\phi(z_t,t,y,I,M) - \epsilon)]$

where only regions inside $M$ are updated with image guidance; outside, only text conditioning applies.

This approach produces globally coherent positioning and locally accurate segmentation of complex objects without explicit interaction.

4. Texture Field Synthesis and Rendering Pipeline

The color field $c(x) = F_\phi(x)$ is parameterized by an MLP of architecture identical to $F_\theta$ , except for output channels (3 per surface point). The network is trained to match $I_{\text{ref}}$ within $p(x)$ using LMIG loss propagated through the rasterized, masked texture field:

Texture Atlas Evaluation:
- For UV-mapped surfaces, colors are synthesized at each texel center ( $u,v$ ), multiplied by $p(x)$ , and rendered to screen.
Additional Losses:
- Total Variation (TV):
$L_{\text{TV}} = \lambda_{\text{TV}} \sum_{u,v} \left( \|c(u+1,v) - c(u,v)\|_1 + \|c(u,v+1) - c(u,v)\|_1 \right)$

promoting local patch smoothness. $\lambda_{\text{TV}}\approx 10^{-4}$ . - Perceptual (VGG) Loss:

$L_{\text{perc}} = \lambda_p \sum_l \|\Phi_l(x_{\text{loc}}) - \Phi_l(I_{\text{ref}})\|^2_2$

with $\lambda_p \approx 1$ , where $\Phi_l(\cdot)$ denote VGG features.

5. Implementation Protocol

Data:
- Meshes include VOC 3D models, real human scans, busts, animals.
- Reference images feature high-resolution photos of clothing, accessories, and decorative patterns.
- Prompts composed as, e.g., “a <mesh-class> with <object>.”
Optimization:
- Warm-up ($1$k iters): $L_{\text{loc}}$ only, localizes $F_\theta$ using text-driven SDS.
- Joint phase ($10$k iters): $L_{\text{loc}} + L_{\text{SDS}_\text{loc}}$ on both $F_\theta$ and $F_\phi$ .
- Adam optimizer, learning rate decay $1 \times 10^{-3} \to 1 \times 10^{-4}$ , $(\beta_1, \beta_2) = (0.9, 0.999)$ .
- Each iteration uses one random view, $512 \times 512$ render.
- Full training: $\approx 4$ hours on an A40 GPU, with reasonable quality achieved in $\approx 1$ hour.

6. Evaluation, Benchmarks, and Ablations

Quantitative Metrics:
- CLIP R-Precision (textual and image-based) between $I_{\text{ref}}$ and mesh renders.
- Cosine similarity in CLIP embedding space.
- LPIPS (AlexNet/VGG) for perceptual similarity.
- User ratings (1–5) on texture and structure quality.
- Intersection-over-Union (IoU) for predicted mask $p(x)$ in synthetic settings.
Performance:
- Exceeds a text-only editor (3D Paintbrush) by more than $15$ percentage points in CLIP R-Precision.
- Achieves $4$– $8\%$ lower LPIPS.
- User studies report mean scores: $4.1$ (3D PixBrush) vs. $2.7$/$2.0$ (baselines) for structure/texture.
Qualitative Ablations:
- Without cross-attention masking, image information leaks outside the mask region.
- Removing warm-up leads to poor mask separability and over-globalized texture.

7. Limitations and Prospective Extensions

Failure Modes:
- Fine detail (e.g., logos, small text) in $I_{\text{ref}}$ are rendered illegibly (e.g., “adadds”).
- Semantic bleed: co-texturing in related regions (makeup applied to lips).
- Janus effect: mirrored/symmetric texture propagation front to back.
Future Directions:
- Extension of LMIG to NeRFs, point clouds, volumetric grids, or video.
- Integration of geometric deformation within active region $p(x)$ .
- Support for simultaneous multi-object editing, stacking disjoint $(p_i, c_i)$ fields.

A plausible implication is that LMIG and its multi-modal guidance framework are generally applicable to broader classes of 3D editing, where geometry-aware, high-fidelity, and user-free local synthesis are sought (Decatur et al., 4 Jul 2025).

Markdown Upgrade to Chat

References (1)

3D PixBrush: Image-Guided Local Texture Synthesis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D PixBrush.