MATTE: Multi-Attribute Inversion
- Multi-Attribute Inversion (MATTE) is a framework that learns separate token embeddings for color, object, style, and layout from a single reference image.
- It overcomes previous limitations by optimizing tokens across both cross-attention layers and denoising timesteps, ensuring clean disentanglement of visual attributes.
- Empirical results show MATTEās superior controllability and quality with improved CLIP similarity scores and favorable user study outcomes.
Multi-Attribute Inversion (MATTE) is a multi-token inversion framework for diffusion models, designed to disentangle and separately control distinct visual attributesāspecifically color, object, style, and layoutāwhen constraining text-to-image synthesis with a reference image. In MATTE, four distinct token embeddings are learned from a single reference image via targeted optimization in the @@@@1@@@@ and denoising timesteps of a latent diffusion model, thereby overcoming limitations of prior single-token and axis-aligned multi-token inversion approaches. MATTE enables independent recombination and transfer of visual attributes for controllable image generation, with substantially improved attribute disentanglement as demonstrated in both quantitative and qualitative experiments (Agarwal et al., 2023).
1. Foundations: Latent Diffusion and Textual Inversion
MATTE builds on Denoising Diffusion Probabilistic Models (DDPMs) operating in latent space. Given an autoencoder / mapping images to and back, forward diffusion adds Gaussian noise via over timesteps, while a U-Net predicts the noise at each reverse step. Generation samples and denoises for , decoding with .
Textual conditioning is realized via cross-attention in the U-Net: a text encoder produces token embeddings ; at each cross-attention layer , queries attend over key and value projections of to inject semantic context. Standard textual inversion methods learn a single new token embedding for a reference image, optimizing such that reconstructs it when the learned token is prepended to the prompt.
2. Motivation and Limitations of Prior Inversion Approaches
Single-token inversion approaches, such as Textual Inversion (TI) and DreamBooth, encode all reference image attributesācolor, object, layout, styleāinto one vector, only enabling recreation of "images like" the reference without attribute-specific control.
Subsequent works have attempted to expand the inversion space:
- P+ (Voynov et al.): Learns one token per cross-attention layer (16 tokens).
- ProSpect (Zhang et al.): Learns one token for each denoising timestep stage (e.g., 10 tokens).
However, these methods align disentanglement solely along one axisālayer or timestep. Empirical analysis shows that different attributes (e.g., color and style, layout and color) are captured in overlapping sets of layers and/or timesteps, rendering single-axis decompositions insufficient for cleanly disentangling attributes.
3. Attribute Localization in Layers and Timesteps
Ablation experiments assess where attributes are encoded within the DDPM U-Net's architecture. Layers are grouped as Fine (), Moderate (), and Coarse (); timesteps are split into four stages (ā$1000$, ā$800$, ā$600$, ā$200$).
Findings:
- Color & Style: Captured primarily in moderate layers (, ) and early timesteps (, ).
- Object Semantics: Captured in coarse layers () and middle timesteps (, ).
- Layout: Encoded in coarse layers (), mainly at the very earliest stage ().
- Fine layers and late timesteps (): Little semantic content for target attributes.
A diagnostic procedure quantifies attribute entanglement: by holding one attribute's token fixed and varying another, then measuring CLIP-based retrieval scores, disentanglement strength is objectively evaluated.
4. Multi-Attribute Inversion Scheme and Algorithm
MATTE learns four separate tokens (, , , for color, object, style, layout) from a single reference image with model weights fixed. The attribute-token map specifies which tokens are active at which layer-group and timestep-stage, as established via disentanglement analysis.
For each diffusion iteration:
- The reconstruction loss minimizes the distance between predicted and actual noise.
- The disentanglement loss encourages (color) to align with the ground-truth color (obtained via a palette extractor) and to be orthogonal to style (), penalizing mutual information.
- The object regularization loss aligns to the reference object's CLIP embedding .
The combined inversion objective is
with .
At each step, only tokens active at the relevant (layer group, timestep stage) are updated, enforcing sparsity and specificity in token learning.
MATTE Inversion Pseudocode Overview
Inputs: reference image , static text prompt, diffusion steps , layer groups, stage ranges, and token mapping.
For each optimization iteration:
- Sample diffusion timestep and determine stage.
- Encode to latent ; noise .
- Construct prompt with static words and active learned tokens per the attribute-conditioning map.
- Predict noise, compute all loss terms.
- Compute gradients w.r.t. the four tokens and apply Adam updates.
- Return learned tokens after iterations.
5. Implementation and Hyperparameters
MATTE uses AdamW () with no weight decay for tokens and a batch size of 1 (single-image inversion). Layer and timestep splits match empirical attribute localization: moderate, coarse, and fine layer groups; four diffusion stage splits. Regularizer weights are $0.1$ for disentanglement losses. Ground-truth attribute values are extracted with ColorThief (for color) and CLIP-based nearest neighbors (for object). Convergence is typically reached within 200 iterations, but 500ā1000 are used.
6. Empirical Results and Comparative Evaluation
Qualitative Transfer and Mixing
MATTE achieves multi-attribute transfer by recombining learned tokens: color of reference, object of prompt, style of reference, layout of prompt (cf. Figures 1, 4ā6 in (Agarwal et al., 2023)). Layer-only and timestep-only baselines (P+, ProSpect) struggle when target attributes co-occur in the same architectural regions.
Quantitative Metrics
- Token correctness: CLIP similarity between MATTE generations and ground-truth for color, object, style: $0.71/0.72/0.92$ (imageāimage), $0.74/0.73/0.87$ (textātext).
- Attribute disentanglement: MATTE outperforms P+ and ProSpect on all 6 attribute pairs (mean CLIP sim for layoutācolor $0.26$ vs $0.24/0.19$).
- Ablation: Removing and reduces token correctness by CLIP sim.
- User study: In a forced-choice design, 24 participants preferred MATTE samples in of pairings, versus / for baselines.
7. Limitations and Future Research
MATTEās inversion procedure remains computationally intensive, typically requiring hundreds of gradient steps per image. The approach is constrained by the representational capacity of the fixed base diffusion model, so it cannot guarantee the inclusion of every possible semantic concept in prompts. Only four attributes are disentangled; more subtle properties (e.g., lighting, nuanced semantics) are not addressed. Future work envisions combining MATTE with lightweight fine-tuning, extending to a richer attribute set, and improving inversion efficiency via meta-learning (Agarwal et al., 2023).
MATTE establishes that simultaneous optimization over both cross-attention layer and denoising-timestep dimensions is necessary for true multi-attribute disentanglement from a reference image. It introduces dedicated, disentangled tokens for color, object, style, and layout, enabling precise, compositional text-to-image synthesis that surpasses prior axis-aligned approaches (Agarwal et al., 2023).