Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

FlexPainter: 3D Texture Generation Pipeline

Updated 30 June 2025

FlexPainter is a 3D texture generation pipeline that integrates multi-modal conditioning with a diffusion-based framework.
It achieves high multi-view consistency and resolution by leveraging a shared embedding space, adaptive view synchronization, and image-based classifier-free guidance.
The system supports artist-controllable workflows for stylization and texture completion, enabling practical, high-fidelity outputs for 3D modeling applications.

FlexPainter is a texture generation pipeline designed for 3D modeling applications that emphasizes both flexible, multi-modal conditional control and high multi-view consistency. The approach integrates advanced diffusion models, a shared conditional embedding space, adaptive prompt guidance, and robust modules for multi-view fusion and texture completion. FlexPainter targets the production of seamless, high-resolution texture maps that accurately reflect diverse input prompts and maintain geometric and stylistic coherence across 3D object surfaces.

1. System Architecture and Objectives

FlexPainter is structured to address three core challenges in modern 3D-aware texture synthesis:

The need for flexible multi-modal conditioning, allowing creators to guide generation by any combination of textual prompts and image references.
The problem of multi-view inconsistency, where textures generated separately for different viewpoints can be misaligned or contain artifacts.
The requirement for practical, high-resolution, artist-controllable workflow, including prompt blending, stylization, and real-world applicability at up to 4K texture resolutions.

The architecture achieves these objectives by combining several innovations: a shared embedding space for modalities, a diffusion-based multi-view generator, image-based classifier-free guidance (CFG), view synchronization and adaptive weighting during the diffusion process, and specialized modules for 3D-aware texture completion and enhancement.

2. Diffusion-Based Generation Framework

FlexPainter uses a latent-space diffusion generative process, specifically a flow matching model. For a data instance $x_0$ and Gaussian noise $\epsilon$ :

$x_t = (1-t)x_0 + t\epsilon, \quad \frac{dx}{dt} = v_t(x_t, \Theta)$

where $v_t$ is a neural velocity predictor parameterized by $\Theta$ .

Key advantages for texture generation are:

High-fidelity sample quality, leveraging large diffusion pretraining.
Strong controllability via conditional embeddings.
Fine-grained multi-view consistency through joint attention and fusion mechanisms during sampling.

FlexPainter constructs a conditioning embedding $\mathbf{v}$ that serves as a unified representation for text and images:

$\mathbf{v} = \mathcal{T}(w_{cond}) \bigoplus \sum_{i=1}^{n} \alpha_i \mathcal{I}(I^i_{cond})$

Here, $\mathcal{T}$ and $\mathcal{I}$ are text and image embedders, $w_{cond}$ is the textual prompt, $I^i_{cond}$ the reference images, and $\alpha_i$ are combination weights. This configuration allows:

Any mix of textual and visual guidance—pure text, pure image, or any blend—in a linearly mixable space.
Prompt manipulation, such as text-guided refinement of image references, reference-based stylization, or even prompt interpolation for nuanced effects.
Direct feeding of this embedding into the diffusion process through cross-attention layers, enriching both semantic and structural detail transfer into texture synthesis.

This mechanism supports more versatile and creative texture control than single-modality or hard-coded prompt type pipelines.

4. Multi-View Generation and 3D Consistency

To leverage object geometry and ensure cross-view consistency, FlexPainter generates a $2 \times 2$ grid of images, each corresponding to a perspective view of the 3D object, augmented with per-view depth maps. All views are processed by the model simultaneously, enabling:

Implicit 3D understanding through joint modeling of all perspectives.
Feature sharing and global attention, reducing typical "Janus" (double-faced) artifacts seen in naïve per-view generation.
Depth information to reinforce geometric alignment and plausible occlusions.

This view grid representation forms the foundation for subsequent consistency and completion procedures in the pipeline.

5. View Synchronization and Adaptive Weighting

Traditional multi-view fusion faces heuristic weighting limitations and vulnerability to local imperfections. FlexPainter introduces:

Reprojection-based view synchronization: At each diffusion step $t$ , partial results from each camera are reprojected into UV space and aggregated, forming an updated, fused UV texture for the next round.
Adaptive Weighting via WeighterNet: A neural module predicts fusion weights for each partial texture, considering UV content, rendered rays, normals, 3D positions, and diffusion time $t$ . The architecture comprises Vision Transformers for UV-space blocks and Point Transformers for geometric features.

Training loss functions include:

$L_{perc} = e^{-\alpha t} \sum_{i=1}^4 \mathrm{vgg}(\mathcal{R}(\bar{T}, c_i)) - \mathrm{vgg}(\mathcal{R}(\mathcal{W}(T^t_{1:4}, \Theta), c_i))$

$L_{cyc} = \sum_{i=1}^4 \mathcal{R}(T^t_i, c_i) - \mathcal{R}(\mathcal{W}(T^t_{1:4}, \Theta), c_i)$

where $\mathcal{R}$ denotes the (back-)rendering operation, $c_i$ are camera poses, and $\mathcal{W}$ the adaptive fusion function.

The result is robust, locally and globally consistent textures across all viewpoints—critical for downstream seamless UV mapping.

6. Image-Based Classifier-Free Guidance (CFG)

FlexPainter extends traditional CFG by integrating image prompts alongside text prompts, allowing:

Both text and image–derived embeddings as positive and negative prompts.
For stylization, a reference image can serve as a positive guide, with a grayscaled (structure-only) version as negative guidance, enabling selective transfer of style without enforcing global structure.
Prompt-based separation and recombination of structure and style, orchestrated entirely within the shared embedding space, facilitating direct practical stylization and artifact suppression.

This design supports advanced creator workflows, including negative guidance for removing unwanted visual traits from references without requiring additional disentanglement models.

7. 3D-Aware Texture Completion and Enhancement

Occlusions and view-dependent gaps in the UV map are resolved through:

Texture Completion: A 3D-aware completion model (finetuned TEXGen) that receives the aggregated, partial multi-view UV texture and inpaints missing areas by leveraging both UV-space and local 3D point features. This ensures semantic and geometric coherence, even across complex mesh boundaries.
Texture Enhancement: Super-resolution of the completed UV texture (e.g., Real-ESRGAN) for output up to 4K, with preservation of style and low-level detail.

This combination yields seamless, high-fidelity, and artifact-free textures for rendering.

8. Benchmarks and Empirical Results

FlexPainter demonstrates superior performance on quantitative and human evaluation metrics:

Method	FID (↓)	KID (×10⁻⁴, ↓)	User Preference (%)
TEXTure	78.27	107.06	4.4
Text2Tex	89.01	94.42	11.3
SyncMVD	73.38	41.74	22.5
Paint3D	77.80	123.61	17.8
TEXGen	72.90	61.73	15.7
Ours	71.62	58.47	28.3

For image-to-texture tasks, FlexPainter achieves FID 59.49 and KID 62.09, with 71.4% user preference, outperforming Paint3D. Ablation studies confirm the necessity of view synchronization, WeighterNet, and image-based CFG for optimal performance.

Qualitative assessments indicate that FlexPainter's outputs are more closely aligned with prompts and more consistent across surface and view than prior methods.

9. Applications and Impact

FlexPainter is well suited for:

High-quality text-to-texture and image-to-texture generation in real-world 3D content creation pipelines.
Flexible stylization tasks combining custom artwork with descriptive semantics.
Creative workflows requiring prompt mixing, structural or style decomposition, negative guidance, and seamless integration with 3D rendering engines.
Scenarios demanding high resolution and robust consistency for both artistic and physically accurate rendering.

The system’s advances suggest significant impact on digital content creation, allowing both technical and non-technical users to author sophisticated, consistent textures for a wide variety of 3D models.

Key FlexPainter Components

Component	Technical Role	Formula/Detail
Shared embedding space	Multi-modal, linearly mixable conditioning	$\mathbf{v} = \mathcal{T}(w_{cond}) \bigoplus \sum_i \alpha_i \mathcal{I}(I^i_{cond})$
Image-based CFG	Structural/style separation & stylization	Reference/gray images as positive/negative prompt
Multi-view image grid	Global 3D consistency	$2 \times 2$ grid, four camera perspectives
View synchronization & WeighterNet	Local cross-view alignment	Reprojection & adaptive neural fusion
3D-aware completion/enhancement	Filling & super-resolving UV texture	TEXGen, Real-ESRGAN

FlexPainter, by integrating a unified prompt embedding space, a robust diffusion backbone, adaptive multi-view and stylization modules, and high-resolution completion, sets a new standard for flexible, consistent, and user-controllable texture generation in 3D modeling and content creation.

PDF Markdown Chat (Upgrade)