MATTE: Multi-Attribute Inversion

Updated 29 January 2026

Multi-Attribute Inversion (MATTE) is a framework that learns separate token embeddings for color, object, style, and layout from a single reference image.
It overcomes previous limitations by optimizing tokens across both cross-attention layers and denoising timesteps, ensuring clean disentanglement of visual attributes.
Empirical results show MATTE’s superior controllability and quality with improved CLIP similarity scores and favorable user study outcomes.

Multi-Attribute Inversion (MATTE) is a multi-token inversion framework for diffusion models, designed to disentangle and separately control distinct visual attributes—specifically color, object, style, and layout—when constraining text-to-image synthesis with a reference image. In MATTE, four distinct token embeddings are learned from a single reference image via targeted optimization in the @@@@1@@@@ and denoising timesteps of a latent diffusion model, thereby overcoming limitations of prior single-token and axis-aligned multi-token inversion approaches. MATTE enables independent recombination and transfer of visual attributes for controllable image generation, with substantially improved attribute disentanglement as demonstrated in both quantitative and qualitative experiments (Agarwal et al., 2023).

1. Foundations: Latent Diffusion and Textual Inversion

MATTE builds on Denoising Diffusion Probabilistic Models (DDPMs) operating in latent space. Given an autoencoder $E$ / $Dec$ mapping images $I$ to $z = E(I)$ and back, forward diffusion adds Gaussian noise via $q(z_t|z_{t-1})$ over $T$ timesteps, while a U-Net $\epsilon_\Theta$ predicts the noise at each reverse step. Generation samples $z_T\sim\mathcal{N}(0,I)$ and denoises for $T\to0$ , decoding with $Dec(z_0)$ .

Textual conditioning is realized via cross-attention in the U-Net: a text encoder produces token embeddings $c = \{c_1, ..., c_L\}$ ; at each cross-attention layer $\ell$ , queries $Q_\ell(z_t)$ attend over key and value projections of $c$ to inject semantic context. Standard textual inversion methods learn a single new token embedding for a reference image, optimizing such that $\epsilon_\Theta$ reconstructs it when the learned token is prepended to the prompt.

2. Motivation and Limitations of Prior Inversion Approaches

Single-token inversion approaches, such as Textual Inversion (TI) and DreamBooth, encode all reference image attributes—color, object, layout, style—into one vector, only enabling recreation of "images like" the reference without attribute-specific control.

Subsequent works have attempted to expand the inversion space:

P+ (Voynov et al.): Learns one token per cross-attention layer (16 tokens).
ProSpect (Zhang et al.): Learns one token for each denoising timestep stage (e.g., 10 tokens).

However, these methods align disentanglement solely along one axis—layer or timestep. Empirical analysis shows that different attributes (e.g., color and style, layout and color) are captured in overlapping sets of layers and/or timesteps, rendering single-axis decompositions insufficient for cleanly disentangling attributes.

3. Attribute Localization in Layers and Timesteps

Ablation experiments assess where attributes are encoded within the DDPM U-Net's architecture. Layers are grouped as Fine ( $L_{1-2}, L_{14-16}$ ), Moderate ( $L_{3-5}, L_{10-13}$ ), and Coarse ( $L_{6-9}$ ); timesteps are split into four stages ( $t_1':800$ –$1000$, $t_2':600$ –$800$, $t_3':200$ –$600$, $t_4':0$ –$200$).

Findings:

Color & Style: Captured primarily in moderate layers ( $L_{3-5}$ , $L_{10-13}$ ) and early timesteps ( $t_1'$ , $t_2'$ ).
Object Semantics: Captured in coarse layers ( $L_{6-9}$ ) and middle timesteps ( $t_2'$ , $t_3'$ ).
Layout: Encoded in coarse layers ( $L_{6-9}$ ), mainly at the very earliest stage ( $t_1'$ ).
Fine layers and late timesteps ( $t_4'$ ): Little semantic content for target attributes.

A diagnostic procedure quantifies attribute entanglement: by holding one attribute's token fixed and varying another, then measuring CLIP-based retrieval scores, disentanglement strength is objectively evaluated.

4. Multi-Attribute Inversion Scheme and Algorithm

MATTE learns four separate tokens ( $\langle c \rangle$ , $\langle o \rangle$ , $\langle s \rangle$ , $\langle l \rangle$ for color, object, style, layout) from a single reference image with model weights fixed. The attribute-token map specifies which tokens are active at which layer-group and timestep-stage, as established via disentanglement analysis.

For each diffusion iteration:

The reconstruction loss $\mathcal{L}_R$ minimizes the $\ell_2$ distance between predicted and actual noise.
The disentanglement loss $\mathcal{L}_{CS}$ encourages $\langle c \rangle$ (color) to align with the ground-truth color $c_{gt}$ (obtained via a palette extractor) and to be orthogonal to style ( $\langle s \rangle$ ), penalizing mutual information.
The object regularization loss $\mathcal{L}_{O}$ aligns $\langle o \rangle$ to the reference object's CLIP embedding $o_{gt}$ .

The combined inversion objective is

$\mathcal{L}_{inv} = \mathcal{L}_R + \lambda_{CS} \mathcal{L}_{CS} + \lambda_O \mathcal{L}_O$

with $\lambda_{CS} = \lambda_O = 0.1$ .

At each step, only tokens active at the relevant (layer group, timestep stage) are updated, enforcing sparsity and specificity in token learning.

MATTE Inversion Pseudocode Overview

Inputs: reference image $I_{ref}$ , static text prompt, diffusion steps $T=1000$ , layer groups, stage ranges, and token mapping.

For each optimization iteration:

Sample diffusion timestep and determine stage.
Encode $I_{ref}$ to latent $z_0$ ; noise $z_t$ .
Construct prompt with static words and active learned tokens per the attribute-conditioning map.
Predict noise, compute all loss terms.
Compute gradients w.r.t. the four tokens and apply Adam updates.
Return learned tokens after $N_{steps}\approx 500-1000$ iterations.

5. Implementation and Hyperparameters

MATTE uses AdamW ( $\mathrm{lr}=10^{-2}, \beta=(0.9,0.999)$ ) with no weight decay for tokens and a batch size of 1 (single-image inversion). Layer and timestep splits match empirical attribute localization: moderate, coarse, and fine layer groups; four diffusion stage splits. Regularizer weights are $0.1$ for disentanglement losses. Ground-truth attribute values are extracted with ColorThief (for color) and CLIP-based nearest neighbors (for object). Convergence is typically reached within 200 iterations, but 500–1000 are used.

6. Empirical Results and Comparative Evaluation

Qualitative Transfer and Mixing

MATTE achieves multi-attribute transfer by recombining learned tokens: color of reference, object of prompt, style of reference, layout of prompt (cf. Figures 1, 4–6 in (Agarwal et al., 2023)). Layer-only and timestep-only baselines (P+, ProSpect) struggle when target attributes co-occur in the same architectural regions.

Quantitative Metrics

Token correctness: CLIP similarity between MATTE generations and ground-truth for color, object, style: $0.71/0.72/0.92$ (image–image), $0.74/0.73/0.87$ (text–text).
Attribute disentanglement: MATTE outperforms P+ and ProSpect on all 6 attribute pairs (mean CLIP sim for layout–color $0.26$ vs $0.24/0.19$).
Ablation: Removing $\mathcal{L}_{CS}$ and $\mathcal{L}_O$ reduces token correctness by $\sim0.1$ CLIP sim.
User study: In a forced-choice design, 24 participants preferred MATTE samples in $74.6\%$ of pairings, versus $12.2\%$ / $13.2\%$ for baselines.

7. Limitations and Future Research

MATTE’s inversion procedure remains computationally intensive, typically requiring hundreds of gradient steps per image. The approach is constrained by the representational capacity of the fixed base diffusion model, so it cannot guarantee the inclusion of every possible semantic concept in prompts. Only four attributes are disentangled; more subtle properties (e.g., lighting, nuanced semantics) are not addressed. Future work envisions combining MATTE with lightweight fine-tuning, extending to a richer attribute set, and improving inversion efficiency via meta-learning (Agarwal et al., 2023).

MATTE establishes that simultaneous optimization over both cross-attention layer and denoising-timestep dimensions is necessary for true multi-attribute disentanglement from a reference image. It introduces dedicated, disentangled tokens for color, object, style, and layout, enabling precise, compositional text-to-image synthesis that surpasses prior axis-aligned approaches (Agarwal et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Attribute Inversion (MATTE).