Papers
Topics
Authors
Recent
Search
2000 character limit reached

MATTE: Multi-Attribute Inversion

Updated 29 January 2026
  • Multi-Attribute Inversion (MATTE) is a framework that learns separate token embeddings for color, object, style, and layout from a single reference image.
  • It overcomes previous limitations by optimizing tokens across both cross-attention layers and denoising timesteps, ensuring clean disentanglement of visual attributes.
  • Empirical results show MATTE’s superior controllability and quality with improved CLIP similarity scores and favorable user study outcomes.

Multi-Attribute Inversion (MATTE) is a multi-token inversion framework for diffusion models, designed to disentangle and separately control distinct visual attributes—specifically color, object, style, and layout—when constraining text-to-image synthesis with a reference image. In MATTE, four distinct token embeddings are learned from a single reference image via targeted optimization in the @@@@1@@@@ and denoising timesteps of a latent diffusion model, thereby overcoming limitations of prior single-token and axis-aligned multi-token inversion approaches. MATTE enables independent recombination and transfer of visual attributes for controllable image generation, with substantially improved attribute disentanglement as demonstrated in both quantitative and qualitative experiments (Agarwal et al., 2023).

1. Foundations: Latent Diffusion and Textual Inversion

MATTE builds on Denoising Diffusion Probabilistic Models (DDPMs) operating in latent space. Given an autoencoder EE/DecDec mapping images II to z=E(I)z = E(I) and back, forward diffusion adds Gaussian noise via q(zt∣ztāˆ’1)q(z_t|z_{t-1}) over TT timesteps, while a U-Net ϵΘ\epsilon_\Theta predicts the noise at each reverse step. Generation samples zT∼N(0,I)z_T\sim\mathcal{N}(0,I) and denoises for T→0T\to0, decoding with Dec(z0)Dec(z_0).

Textual conditioning is realized via cross-attention in the U-Net: a text encoder produces token embeddings c={c1,...,cL}c = \{c_1, ..., c_L\}; at each cross-attention layer ā„“\ell, queries Qā„“(zt)Q_\ell(z_t) attend over key and value projections of cc to inject semantic context. Standard textual inversion methods learn a single new token embedding for a reference image, optimizing such that ϵΘ\epsilon_\Theta reconstructs it when the learned token is prepended to the prompt.

2. Motivation and Limitations of Prior Inversion Approaches

Single-token inversion approaches, such as Textual Inversion (TI) and DreamBooth, encode all reference image attributes—color, object, layout, style—into one vector, only enabling recreation of "images like" the reference without attribute-specific control.

Subsequent works have attempted to expand the inversion space:

  • P+ (Voynov et al.): Learns one token per cross-attention layer (16 tokens).
  • ProSpect (Zhang et al.): Learns one token for each denoising timestep stage (e.g., 10 tokens).

However, these methods align disentanglement solely along one axis—layer or timestep. Empirical analysis shows that different attributes (e.g., color and style, layout and color) are captured in overlapping sets of layers and/or timesteps, rendering single-axis decompositions insufficient for cleanly disentangling attributes.

3. Attribute Localization in Layers and Timesteps

Ablation experiments assess where attributes are encoded within the DDPM U-Net's architecture. Layers are grouped as Fine (L1āˆ’2,L14āˆ’16L_{1-2}, L_{14-16}), Moderate (L3āˆ’5,L10āˆ’13L_{3-5}, L_{10-13}), and Coarse (L6āˆ’9L_{6-9}); timesteps are split into four stages (t1′:800t_1':800–$1000$, t2′:600t_2':600–$800$, t3′:200t_3':200–$600$, t4′:0t_4':0–$200$).

Findings:

  • Color & Style: Captured primarily in moderate layers (L3āˆ’5L_{3-5}, L10āˆ’13L_{10-13}) and early timesteps (t1′t_1', t2′t_2').
  • Object Semantics: Captured in coarse layers (L6āˆ’9L_{6-9}) and middle timesteps (t2′t_2', t3′t_3').
  • Layout: Encoded in coarse layers (L6āˆ’9L_{6-9}), mainly at the very earliest stage (t1′t_1').
  • Fine layers and late timesteps (t4′t_4'): Little semantic content for target attributes.

A diagnostic procedure quantifies attribute entanglement: by holding one attribute's token fixed and varying another, then measuring CLIP-based retrieval scores, disentanglement strength is objectively evaluated.

4. Multi-Attribute Inversion Scheme and Algorithm

MATTE learns four separate tokens (⟨c⟩\langle c \rangle, ⟨o⟩\langle o \rangle, ⟨s⟩\langle s \rangle, ⟨l⟩\langle l \rangle for color, object, style, layout) from a single reference image with model weights fixed. The attribute-token map specifies which tokens are active at which layer-group and timestep-stage, as established via disentanglement analysis.

For each diffusion iteration:

  • The reconstruction loss LR\mathcal{L}_R minimizes the ā„“2\ell_2 distance between predicted and actual noise.
  • The disentanglement loss LCS\mathcal{L}_{CS} encourages ⟨c⟩\langle c \rangle (color) to align with the ground-truth color cgtc_{gt} (obtained via a palette extractor) and to be orthogonal to style (⟨s⟩\langle s \rangle), penalizing mutual information.
  • The object regularization loss LO\mathcal{L}_{O} aligns ⟨o⟩\langle o \rangle to the reference object's CLIP embedding ogto_{gt}.

The combined inversion objective is

Linv=LR+λCSLCS+λOLO\mathcal{L}_{inv} = \mathcal{L}_R + \lambda_{CS} \mathcal{L}_{CS} + \lambda_O \mathcal{L}_O

with λCS=λO=0.1\lambda_{CS} = \lambda_O = 0.1.

At each step, only tokens active at the relevant (layer group, timestep stage) are updated, enforcing sparsity and specificity in token learning.

MATTE Inversion Pseudocode Overview

Inputs: reference image IrefI_{ref}, static text prompt, diffusion steps T=1000T=1000, layer groups, stage ranges, and token mapping.

For each optimization iteration:

  1. Sample diffusion timestep and determine stage.
  2. Encode IrefI_{ref} to latent z0z_0; noise ztz_t.
  3. Construct prompt with static words and active learned tokens per the attribute-conditioning map.
  4. Predict noise, compute all loss terms.
  5. Compute gradients w.r.t. the four tokens and apply Adam updates.
  6. Return learned tokens after Nstepsā‰ˆ500āˆ’1000N_{steps}\approx 500-1000 iterations.

5. Implementation and Hyperparameters

MATTE uses AdamW (lr=10āˆ’2,β=(0.9,0.999)\mathrm{lr}=10^{-2}, \beta=(0.9,0.999)) with no weight decay for tokens and a batch size of 1 (single-image inversion). Layer and timestep splits match empirical attribute localization: moderate, coarse, and fine layer groups; four diffusion stage splits. Regularizer weights are $0.1$ for disentanglement losses. Ground-truth attribute values are extracted with ColorThief (for color) and CLIP-based nearest neighbors (for object). Convergence is typically reached within 200 iterations, but 500–1000 are used.

6. Empirical Results and Comparative Evaluation

Qualitative Transfer and Mixing

MATTE achieves multi-attribute transfer by recombining learned tokens: color of reference, object of prompt, style of reference, layout of prompt (cf. Figures 1, 4–6 in (Agarwal et al., 2023)). Layer-only and timestep-only baselines (P+, ProSpect) struggle when target attributes co-occur in the same architectural regions.

Quantitative Metrics

  • Token correctness: CLIP similarity between MATTE generations and ground-truth for color, object, style: $0.71/0.72/0.92$ (image–image), $0.74/0.73/0.87$ (text–text).
  • Attribute disentanglement: MATTE outperforms P+ and ProSpect on all 6 attribute pairs (mean CLIP sim for layout–color $0.26$ vs $0.24/0.19$).
  • Ablation: Removing LCS\mathcal{L}_{CS} and LO\mathcal{L}_O reduces token correctness by ∼0.1\sim0.1 CLIP sim.
  • User study: In a forced-choice design, 24 participants preferred MATTE samples in 74.6%74.6\% of pairings, versus 12.2%12.2\%/13.2%13.2\% for baselines.

7. Limitations and Future Research

MATTE’s inversion procedure remains computationally intensive, typically requiring hundreds of gradient steps per image. The approach is constrained by the representational capacity of the fixed base diffusion model, so it cannot guarantee the inclusion of every possible semantic concept in prompts. Only four attributes are disentangled; more subtle properties (e.g., lighting, nuanced semantics) are not addressed. Future work envisions combining MATTE with lightweight fine-tuning, extending to a richer attribute set, and improving inversion efficiency via meta-learning (Agarwal et al., 2023).


MATTE establishes that simultaneous optimization over both cross-attention layer and denoising-timestep dimensions is necessary for true multi-attribute disentanglement from a reference image. It introduces dedicated, disentangled tokens for color, object, style, and layout, enabling precise, compositional text-to-image synthesis that surpasses prior axis-aligned approaches (Agarwal et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Attribute Inversion (MATTE).