Papers
Topics
Authors
Recent
2000 character limit reached

3D PixBrush: Localized Texture Synthesis

Updated 20 January 2026
  • 3D PixBrush is a method for image-driven, localized texture synthesis on 3D meshes that automatically identifies mesh regions and applies reference styles without user input.
  • It utilizes a modified score distillation sampling framework that integrates both text- and image-driven loss terms with an MLP-based mask predictor for precise localization.
  • The system achieves high-fidelity texture placement with global coherence and local precision, outperforming baselines in metrics such as CLIP R-Precision and LPIPS.

3D PixBrush is a method for image-driven, localized texture synthesis on 3D meshes. It introduces a framework that predicts both a localization mask and a region-specific color field on the mesh surface, enabling automatic placement and high-fidelity transfer of visually complex objects or styles from a reference image to a specified mesh region, all without user marking or region selection. Its architecture relies on modifications to score distillation sampling (SDS) that handle both text- and image-driven loss terms, culminating in a system that is globally contextualized and locally precise (Decatur et al., 4 Jul 2025).

1. Problem Formulation and Objectives

3D PixBrush is designed to address the challenge of transplanting an object—defined jointly by appearance and geometric extent—from a 2D reference image onto a 3D mesh in a localized manner. The system operates on:

  • Inputs
    • A 3D mesh MM with UV parameterization; points on the surface are xR3x \in \mathbb{R}^3.
    • A reference image Iref[0,1]H×W×3I_{\text{ref}} \in [0,1]^{H \times W \times 3} depicting the target object/style.
    • A text prompt yy (e.g., “a cow with sunglasses”) specifying coarse semantics.
  • Outputs
    • A probability-valued localization mask p(x)[0,1]p(x) \in [0,1] identifying mesh regions for synthesis.
    • A synthesized color field c(x)[0,1]3c(x) \in [0,1]^3 that matches IrefI_{\text{ref}} within the region p(x)>0p(x) > 0.

The compound objective is to autonomously select, localize, and texture mesh regions so that p(x)p(x) identifies the reference object’s geometry and c(x)c(x) replicates the style/structure of IrefI_{\text{ref}}. No user-provided region hints are required. Optimization is performed jointly over a localization network FθF_\theta and a texture network FϕF_\phi using mixed text- and image-guided SDS losses (Decatur et al., 4 Jul 2025).

2. Mask Prediction Architecture and Training

The localization mask p(x)=Fθ(x)p(x)=F_\theta(x) is parameterized by a compact multilayer perceptron (MLP) coupled to a periodic positional encoder γ:R3R2D\gamma : \mathbb{R}^3 \to \mathbb{R}^{2D} as used in NeRF, forming Fθ(x)=σ(MLPθ(γ(x)))F_\theta(x) = \sigma(\text{MLP}_\theta(\gamma(x))) with σ\sigma being the sigmoid nonlinearity.

  • Network details
    • Six ($6$) MLP blocks: (fully connected → BatchNorm → ReLU), each width $256$.
    • Evaluation per-vertex, per-face, or per-texel as needed.
  • Losses

    1. Text-driven SDS Loss (LlocL_{\text{loc}}): Based on rendering the mask via a differentiable renderer r(;p)r(\cdot;p), producing image mm, and applying SDS with a pretrained diffusion noise predictor ϵψ\epsilon_\psi. The objective iteratively nudges mask renders to match semantic content inferred from the text prompt.
    2. Image-modulated Texture Loss: Propagates gradients through masked c(x)c(x) to penalize mismatches between synthesized texture and reference image in the selected region.
    3. Smoothing Regularization:

    Lsmooth=λsMxp(x)1dx,λs102L_{\text{smooth}} = \lambda_s \int_M \|\nabla_x p(x)\|_1\, dx, \quad \lambda_s \approx 10^{-2}

    enforcing local spatial coherence in the mask to avoid spurious regions.

3. Localization-Modulated Image Guidance Mechanism

The core innovation is "localization-modulated image guidance" (LMIG), which restricts image-conditioned diffusion guidance to only those surface areas selected by the predicted mask:

  • SDS is generalized to support both text (yy) and image (IrefI_{\text{ref}}) conditions via a joint-conditioned diffusion backbone with IP-Adapter.
  • The mask p(x)p(x) is rendered and binarized to generate M=1m>τM = 1_{m > \tau}.
  • At every decoder layer ll, attention from the reference image tokens, CAICA_I, is multiplied with MlM_l (downsampled mask), ensuring spatial correspondence.
  • Formally, the modified gradient is:

xLSDSloc(x,y,I,M)=Et,ϵ[w(t)(ϵϕ(zt,t,y,I,M)ϵ)]\nabla_x L_{\text{SDS}_\text{loc}}(x, y, I, M) = \mathbb{E}_{t, \epsilon}[w(t)(\epsilon_\phi(z_t,t,y,I,M) - \epsilon)]

where only regions inside MM are updated with image guidance; outside, only text conditioning applies.

This approach produces globally coherent positioning and locally accurate segmentation of complex objects without explicit interaction.

4. Texture Field Synthesis and Rendering Pipeline

The color field c(x)=Fϕ(x)c(x) = F_\phi(x) is parameterized by an MLP of architecture identical to FθF_\theta, except for output channels (3 per surface point). The network is trained to match IrefI_{\text{ref}} within p(x)p(x) using LMIG loss propagated through the rasterized, masked texture field:

  • Texture Atlas Evaluation:
    • For UV-mapped surfaces, colors are synthesized at each texel center (u,vu,v), multiplied by p(x)p(x), and rendered to screen.
  • Additional Losses:
    • Total Variation (TV):

    LTV=λTVu,v(c(u+1,v)c(u,v)1+c(u,v+1)c(u,v)1)L_{\text{TV}} = \lambda_{\text{TV}} \sum_{u,v} \left( \|c(u+1,v) - c(u,v)\|_1 + \|c(u,v+1) - c(u,v)\|_1 \right)

    promoting local patch smoothness. λTV104\lambda_{\text{TV}}\approx 10^{-4}. - Perceptual (VGG) Loss:

    Lperc=λplΦl(xloc)Φl(Iref)22L_{\text{perc}} = \lambda_p \sum_l \|\Phi_l(x_{\text{loc}}) - \Phi_l(I_{\text{ref}})\|^2_2

    with λp1\lambda_p \approx 1, where Φl()\Phi_l(\cdot) denote VGG features.

5. Implementation Protocol

  • Data:

    • Meshes include VOC 3D models, real human scans, busts, animals.
    • Reference images feature high-resolution photos of clothing, accessories, and decorative patterns.
    • Prompts composed as, e.g., “a <mesh-class> with <object>.”
  • Optimization:
    • Warm-up ($1$k iters): LlocL_{\text{loc}} only, localizes FθF_\theta using text-driven SDS.
    • Joint phase ($10$k iters): Lloc+LSDSlocL_{\text{loc}} + L_{\text{SDS}_\text{loc}} on both FθF_\theta and FϕF_\phi.
    • Adam optimizer, learning rate decay 1×1031×1041 \times 10^{-3} \to 1 \times 10^{-4}, (β1,β2)=(0.9,0.999)(\beta_1, \beta_2) = (0.9, 0.999).
    • Each iteration uses one random view, 512×512512 \times 512 render.
    • Full training: 4\approx 4 hours on an A40 GPU, with reasonable quality achieved in 1\approx 1 hour.

6. Evaluation, Benchmarks, and Ablations

  • Quantitative Metrics:
    • CLIP R-Precision (textual and image-based) between IrefI_{\text{ref}} and mesh renders.
    • Cosine similarity in CLIP embedding space.
    • LPIPS (AlexNet/VGG) for perceptual similarity.
    • User ratings (1–5) on texture and structure quality.
    • Intersection-over-Union (IoU) for predicted mask p(x)p(x) in synthetic settings.
  • Performance:
    • Exceeds a text-only editor (3D Paintbrush) by more than $15$ percentage points in CLIP R-Precision.
    • Achieves $4$–8%8\% lower LPIPS.
    • User studies report mean scores: $4.1$ (3D PixBrush) vs. $2.7$/$2.0$ (baselines) for structure/texture.
  • Qualitative Ablations:
    • Without cross-attention masking, image information leaks outside the mask region.
    • Removing warm-up leads to poor mask separability and over-globalized texture.

7. Limitations and Prospective Extensions

  • Failure Modes:
    • Fine detail (e.g., logos, small text) in IrefI_{\text{ref}} are rendered illegibly (e.g., “adadds”).
    • Semantic bleed: co-texturing in related regions (makeup applied to lips).
    • Janus effect: mirrored/symmetric texture propagation front to back.
  • Future Directions:
    • Extension of LMIG to NeRFs, point clouds, volumetric grids, or video.
    • Integration of geometric deformation within active region p(x)p(x).
    • Support for simultaneous multi-object editing, stacking disjoint (pi,ci)(p_i, c_i) fields.

A plausible implication is that LMIG and its multi-modal guidance framework are generally applicable to broader classes of 3D editing, where geometry-aware, high-fidelity, and user-free local synthesis are sought (Decatur et al., 4 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D PixBrush.