Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Material Embeddings in CLIP-Space

Updated 30 June 2025
  • Material embeddings in CLIP-space are vector representations that capture visual and semantic properties from exemplar images for material editing.
  • They enable precise material blending and parametric control in diffusion models, allowing smooth interpolation and independent adjustment of material attributes.
  • MARBLE injects embeddings at a targeted UNet layer to achieve fine-grained material recomposition while preserving image geometry and overall semantics.

Material embeddings in CLIP-space are vector representations that capture the appearance and semantics of materials—such as wood, metal, glass, or plastic—within the joint image-text space defined by CLIP encoders. The MARBLE framework utilizes these embeddings to perform material recomposition, blending, and fine-grained control of material properties in images by directly operating in CLIP-space and integrating with pre-trained diffusion models.

1. Material Embeddings in CLIP-Space

In MARBLE, material embeddings are constructed by encoding exemplar images of materials with the CLIP image encoder, resulting in high-dimensional vectors that implicitly encode the visual and semantic aspects of the material. Given two or more exemplar images representing different materials, their CLIP feature vectors (e.g., zm1,zm2z_{m_1}, z_{m_2}) serve as canonical material representations. These embeddings are used to guide downstream generative processes by injecting them into a pre-trained diffusion model at a specific location responsible for material attribution.

For material blending, MARBLE computes convex combinations in CLIP-space: zblend=αzm1+(1α)zm2z_{\text{blend}} = \alpha z_{m_1} + (1 - \alpha) z_{m_2} where α[0,1]\alpha \in [0, 1] sets the blend ratio between two source materials. The blended embedding is used during image generation as the material condition.

2. Methodology: Material Attribution and Editing via CLIP-Space

MARBLE identifies, through ablation, a specific block within the denoising UNet architecture of diffusion models that is predominantly responsible for encoding material information. Rather than injecting material embeddings at all layers (which can introduce geometric or lighting artifacts), MARBLE restricts injection to this material block—found near the bottleneck—yielding improved material transfer without disrupting other semantic aspects of the image.

Formally, image generation is realized as: Igen=S(Iinit,FI,DI,f(zblend))I_{\text{gen}} = \mathcal{S}(I_{\text{init}}, F_I, D_I, f(z_{\text{blend}})) where S\mathcal{S} is the diffusion sampler, IinitI_{\text{init}} is the input image or sketch, FIF_I is a foreground mask, DID_I is depth, and f(zblend)f(z_{\text{blend}}) injects the blended material embedding at the chosen layer in the UNet.

To achieve robust material recomposition, MARBLE supports both binary blending (between two exemplars) and parametric control using learned attribute directions.

3. Material Attributes: Parametric and Fine-Grained Control

MARBLE introduces parametric, slider-like control over fine-grained material attributes—specifically: roughness, metallic, transparency, and glow. To enable this, a shallow MLP network pθp_\theta is trained to predict a direction in CLIP-space for each desired attribute adjustment δ\delta given an input image ImaI_{m_a} with material mm and attribute aa: pθ(Ima,δ)sma+δp_\theta(I_{m_a}, \delta) \approx s_{m_{a+\delta}} where sma+δs_{m_{a+\delta}} is a denoised, low-rank approximation of the CLIP embedding of an image with the target attribute.

At inference,

zma+δ=CLIP(Ima)+pθ(Ima,δ)z_{m_{a+\delta}} = \text{CLIP}(I_{m_a}) + p_\theta(I_{m_a}, \delta)

produces a material embedding corresponding to the attribute change. SVD is used to denoise and reduce the dimensionality of attribute directions, and the approach allows multiple attribute controls to be composed for multi-attribute editing.

4. Qualitative and Quantitative Results

MARBLE is evaluated on synthetic and real images, demonstrating:

  • Material blending: Smooth interpolation between disparate material exemplars, even across objects, with precise control via α\alpha (Fig. 5).
  • Fine-grained parametric editing: Control sliders for roughness, metallic, transparency, glow, enabling independent adjustment of each property (Figs. 6, 8).
  • Disentanglement: Material and shape, color, or geometry remain well separated due to the careful selection of injection location and learned attribute directions.
  • Comparison to baselines: MARBLE outperforms alternative editing approaches (InstructPix2Pix, Concept Slider) on PSNR, LPIPS, CLIP Score, and DreamSim for intended attribute changes (Table 1).
  • User preference: 87.5% of human raters preferred MARBLE outputs for material control tasks on real images.

MARBLE demonstrates high data efficiency: as few as 8–16 synthetic training objects suffice for effective learning of attribute control directions.

5. Multiple Edits and Domain Adaptation

MARBLE supports simultaneous adjustment of multiple material attributes in a single diffusion model forward pass by summing respective attribute directions in CLIP-space, enabling efficient, combinatorial material recomposition. The method is robust across domains and styles, including photographs and diverse artwork such as paintings and anime, due to its reliance solely on modification in CLIP-space and not on explicit retraining of the diffusion backbone.

6. Applications and Broader Significance

MARBLE provides a framework for practical and expressive material editing applicable to product design, industrial visualization, AR/VR asset generation, and digital art creation. Its ability to operate on real photographs, synthetic images, and stylized artwork without model retraining makes it suitable for creative pipelines. The method generalizes to unseen domains (e.g., non-photorealistic painting), and its disentangled parametric controls enable compositional material edits for complex editing scenarios.

Component Description Benefit
Material Embeddings CLIP feature vectors from exemplar images Exemplar-based editing, linear blending
Material Blending Linear interpolation in CLIP-space Smooth, continuous, controllable transitions
Parametric Control Attribute-specific directions via a shallow MLP in CLIP-space Slider-like fine-grained property editing
Attribution Block Targeted injection into denoising UNet at material-relevant layer Preserves geometry, prevents artifacting
Multi-Attribute Summed embedding offsets for joint editing Flexible, single-pass multi-property edits
Domain Robustness No backbone retraining; operates in CLIP-space Wide applicability (photos, paintings, etc.)

7. Limitations and Future Directions

The method may exhibit loss of textural details or minor artifacts for large or compounded edits, particularly when attribute directionality in CLIP-space is ambiguous or insufficiently captured by the synthetic attribute dataset. Future work may focus on extending the vocabulary of editable material attributes, refining the local linearity assumption of CLIP-space for rare materials, or integrating more sophisticated disentanglement mechanisms for even greater attribute specificity.


MARBLE establishes material embeddings in CLIP-space as a practical and highly effective substrate for fine-grained, compositional, and interpretable material editing in images. By isolating and manipulating material directions in CLIP-space and mapping them into generative models, it enables new applications in exemplar-driven manipulation, parametric control, and domain-ambiguous image editing while maintaining strong semantic fidelity and user controllability. For detailed figures and further information, see the project's resource: https://marblecontrol.github.io/.