Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlexPainter: 3D Texture Generation Pipeline

Updated 30 June 2025
  • FlexPainter is a 3D texture generation pipeline that integrates multi-modal conditioning with a diffusion-based framework.
  • It achieves high multi-view consistency and resolution by leveraging a shared embedding space, adaptive view synchronization, and image-based classifier-free guidance.
  • The system supports artist-controllable workflows for stylization and texture completion, enabling practical, high-fidelity outputs for 3D modeling applications.

FlexPainter is a texture generation pipeline designed for 3D modeling applications that emphasizes both flexible, multi-modal conditional control and high multi-view consistency. The approach integrates advanced diffusion models, a shared conditional embedding space, adaptive prompt guidance, and robust modules for multi-view fusion and texture completion. FlexPainter targets the production of seamless, high-resolution texture maps that accurately reflect diverse input prompts and maintain geometric and stylistic coherence across 3D object surfaces.

1. System Architecture and Objectives

FlexPainter is structured to address three core challenges in modern 3D-aware texture synthesis:

  • The need for flexible multi-modal conditioning, allowing creators to guide generation by any combination of textual prompts and image references.
  • The problem of multi-view inconsistency, where textures generated separately for different viewpoints can be misaligned or contain artifacts.
  • The requirement for practical, high-resolution, artist-controllable workflow, including prompt blending, stylization, and real-world applicability at up to 4K texture resolutions.

The architecture achieves these objectives by combining several innovations: a shared embedding space for modalities, a diffusion-based multi-view generator, image-based classifier-free guidance (CFG), view synchronization and adaptive weighting during the diffusion process, and specialized modules for 3D-aware texture completion and enhancement.

2. Diffusion-Based Generation Framework

FlexPainter uses a latent-space diffusion generative process, specifically a flow matching model. For a data instance x0x_0 and Gaussian noise ϵ\epsilon:

xt=(1t)x0+tϵ,dxdt=vt(xt,Θ)x_t = (1-t)x_0 + t\epsilon, \quad \frac{dx}{dt} = v_t(x_t, \Theta)

where vtv_t is a neural velocity predictor parameterized by Θ\Theta.

Key advantages for texture generation are:

  • High-fidelity sample quality, leveraging large diffusion pretraining.
  • Strong controllability via conditional embeddings.
  • Fine-grained multi-view consistency through joint attention and fusion mechanisms during sampling.

3. Multi-Modal Conditional Guidance and Shared Embedding Space

FlexPainter constructs a conditioning embedding v\mathbf{v} that serves as a unified representation for text and images:

v=T(wcond)i=1nαiI(Icondi)\mathbf{v} = \mathcal{T}(w_{cond}) \bigoplus \sum_{i=1}^{n} \alpha_i \mathcal{I}(I^i_{cond})

Here, T\mathcal{T} and I\mathcal{I} are text and image embedders, wcondw_{cond} is the textual prompt, IcondiI^i_{cond} the reference images, and αi\alpha_i are combination weights. This configuration allows:

  • Any mix of textual and visual guidance—pure text, pure image, or any blend—in a linearly mixable space.
  • Prompt manipulation, such as text-guided refinement of image references, reference-based stylization, or even prompt interpolation for nuanced effects.
  • Direct feeding of this embedding into the diffusion process through cross-attention layers, enriching both semantic and structural detail transfer into texture synthesis.

This mechanism supports more versatile and creative texture control than single-modality or hard-coded prompt type pipelines.

4. Multi-View Generation and 3D Consistency

To leverage object geometry and ensure cross-view consistency, FlexPainter generates a 2×22 \times 2 grid of images, each corresponding to a perspective view of the 3D object, augmented with per-view depth maps. All views are processed by the model simultaneously, enabling:

  • Implicit 3D understanding through joint modeling of all perspectives.
  • Feature sharing and global attention, reducing typical "Janus" (double-faced) artifacts seen in naïve per-view generation.
  • Depth information to reinforce geometric alignment and plausible occlusions.

This view grid representation forms the foundation for subsequent consistency and completion procedures in the pipeline.

5. View Synchronization and Adaptive Weighting

Traditional multi-view fusion faces heuristic weighting limitations and vulnerability to local imperfections. FlexPainter introduces:

  • Reprojection-based view synchronization: At each diffusion step tt, partial results from each camera are reprojected into UV space and aggregated, forming an updated, fused UV texture for the next round.
  • Adaptive Weighting via WeighterNet: A neural module predicts fusion weights for each partial texture, considering UV content, rendered rays, normals, 3D positions, and diffusion time tt. The architecture comprises Vision Transformers for UV-space blocks and Point Transformers for geometric features.

Training loss functions include:

Lperc=eαti=14vgg(R(Tˉ,ci))vgg(R(W(T1:4t,Θ),ci))L_{perc} = e^{-\alpha t} \sum_{i=1}^4 \mathrm{vgg}(\mathcal{R}(\bar{T}, c_i)) - \mathrm{vgg}(\mathcal{R}(\mathcal{W}(T^t_{1:4}, \Theta), c_i))

Lcyc=i=14R(Tit,ci)R(W(T1:4t,Θ),ci)L_{cyc} = \sum_{i=1}^4 \mathcal{R}(T^t_i, c_i) - \mathcal{R}(\mathcal{W}(T^t_{1:4}, \Theta), c_i)

where R\mathcal{R} denotes the (back-)rendering operation, cic_i are camera poses, and W\mathcal{W} the adaptive fusion function.

The result is robust, locally and globally consistent textures across all viewpoints—critical for downstream seamless UV mapping.

6. Image-Based Classifier-Free Guidance (CFG)

FlexPainter extends traditional CFG by integrating image prompts alongside text prompts, allowing:

  • Both text and image–derived embeddings as positive and negative prompts.
  • For stylization, a reference image can serve as a positive guide, with a grayscaled (structure-only) version as negative guidance, enabling selective transfer of style without enforcing global structure.
  • Prompt-based separation and recombination of structure and style, orchestrated entirely within the shared embedding space, facilitating direct practical stylization and artifact suppression.

This design supports advanced creator workflows, including negative guidance for removing unwanted visual traits from references without requiring additional disentanglement models.

7. 3D-Aware Texture Completion and Enhancement

Occlusions and view-dependent gaps in the UV map are resolved through:

  • Texture Completion: A 3D-aware completion model (finetuned TEXGen) that receives the aggregated, partial multi-view UV texture and inpaints missing areas by leveraging both UV-space and local 3D point features. This ensures semantic and geometric coherence, even across complex mesh boundaries.
  • Texture Enhancement: Super-resolution of the completed UV texture (e.g., Real-ESRGAN) for output up to 4K, with preservation of style and low-level detail.

This combination yields seamless, high-fidelity, and artifact-free textures for rendering.

8. Benchmarks and Empirical Results

FlexPainter demonstrates superior performance on quantitative and human evaluation metrics:

Method FID (↓) KID (×10⁻⁴, ↓) User Preference (%)
TEXTure 78.27 107.06 4.4
Text2Tex 89.01 94.42 11.3
SyncMVD 73.38 41.74 22.5
Paint3D 77.80 123.61 17.8
TEXGen 72.90 61.73 15.7
Ours 71.62 58.47 28.3

For image-to-texture tasks, FlexPainter achieves FID 59.49 and KID 62.09, with 71.4% user preference, outperforming Paint3D. Ablation studies confirm the necessity of view synchronization, WeighterNet, and image-based CFG for optimal performance.

Qualitative assessments indicate that FlexPainter's outputs are more closely aligned with prompts and more consistent across surface and view than prior methods.

9. Applications and Impact

FlexPainter is well suited for:

  • High-quality text-to-texture and image-to-texture generation in real-world 3D content creation pipelines.
  • Flexible stylization tasks combining custom artwork with descriptive semantics.
  • Creative workflows requiring prompt mixing, structural or style decomposition, negative guidance, and seamless integration with 3D rendering engines.
  • Scenarios demanding high resolution and robust consistency for both artistic and physically accurate rendering.

The system’s advances suggest significant impact on digital content creation, allowing both technical and non-technical users to author sophisticated, consistent textures for a wide variety of 3D models.


Key FlexPainter Components

Component Technical Role Formula/Detail
Shared embedding space Multi-modal, linearly mixable conditioning v=T(wcond)iαiI(Icondi)\mathbf{v} = \mathcal{T}(w_{cond}) \bigoplus \sum_i \alpha_i \mathcal{I}(I^i_{cond})
Image-based CFG Structural/style separation & stylization Reference/gray images as positive/negative prompt
Multi-view image grid Global 3D consistency 2×22 \times 2 grid, four camera perspectives
View synchronization & WeighterNet Local cross-view alignment Reprojection & adaptive neural fusion
3D-aware completion/enhancement Filling & super-resolving UV texture TEXGen, Real-ESRGAN

FlexPainter, by integrating a unified prompt embedding space, a robust diffusion backbone, adaptive multi-view and stylization modules, and high-resolution completion, sets a new standard for flexible, consistent, and user-controllable texture generation in 3D modeling and content creation.