Papers
Topics
Authors
Recent
2000 character limit reached

Contrastive Object-Centric Diffusion Alignment

Updated 6 January 2026
  • The paper introduces CODA, which integrates slot attention with a frozen diffusion backbone and contrastive alignment loss to address slot entanglement and improve object-to-slot correspondence.
  • Its architecture leverages DINOv2 for feature extraction and register slots to absorb ambiguous background attention, enabling efficient fine-tuning of cross-attention projections.
  • Empirical results show significant improvements in object discovery and compositional image generation across synthetic and real-world benchmarks, with notable gains in FG-ARI and reconstruction metrics.

Contrastive Object-centric Diffusion Alignment (CODA) is an augmentation to object-centric learning (OCL) frameworks that integrates slot attention mechanisms with pretrained diffusion models. CODA addresses critical challenges in OCL, specifically slot entanglement and weak slot-image correspondence, by introducing register slots to capture residual attention and applying a contrastive alignment loss to promote explicit object-to-slot assignments. This joint strategy strengthens mutual information between slot representations and input images, leading to improved object discovery, property prediction, and compositional generation performance across both synthetic and real-world visual domains (Nguyen et al., 3 Jan 2026).

1. System Architecture

CODA is constructed atop a frozen Stable Diffusion v1.5 denoising backbone and a DINOv2 (ViT-B/14) vision encoder. The pipeline can be described as:

  • An input image xRH×W×3x \in \mathbb{R}^{H \times W \times 3} is encoded via DINOv2 into MM feature vectors {h1,,hM}RM×Din\{h_1, \ldots, h_M\} \in \mathbb{R}^{M \times D_{in}}.
  • Slot Attention (SA) iteratively refines NN randomly initialized slot queries S=[s1,,sN]RN×DslotS = [s_1, \ldots, s_N] \in \mathbb{R}^{N \times D_{slot}}, yielding NN object-centric vectors.
  • RR register slots RˉRR×Dslot\bar{R} \in \mathbb{R}^{R \times D_{slot}}, obtained from encoding only padding tokens with the frozen CLIP text encoder from Stable Diffusion, are prepended to the slots.
  • At each U-Net cross-attention layer, the key/value set is [S;Rˉ]R(N+R)×Dslot[S;\bar{R}] \in \mathbb{R}^{(N+R) \times D_{slot}}.
  • The softmax operation across N+RN+R slots channels ambiguous/background attention to the register slots, insulating semantic slots from interference.
  • With all U-Net weights frozen except for the key, value, and output projections in every cross-attention layer, the fine-tuning is limited and computationally efficient.

This architecture is visualized as: %%%%10%%%%; S+Rˉcross-attn in U-NetS + \bar{R} \rightarrow \text{cross-attn in U-Net} \rightarrow denoising prediction Ldm\rightarrow L_{dm}, while negative slot sets S~\tilde{S} yield the contrastive loss LCAL_{CA}.

2. Training Objective and Mathematical Formulation

CODA’s objective combines diffusion reconstruction and contrastive alignment losses:

2.1 Diffusion Reconstruction Loss

Given SD latents z=Evae(x)z = E_{\text{vae}}(x) and noisy latents zγ=σ(γ)z+σ(γ)ϵz_\gamma = \sqrt{\sigma(\gamma)} z + \sqrt{\sigma(-\gamma)} \epsilon at log-SNR γ\gamma, the U-Net predicts noise ϵ^θ(zγ,γ,S,Rˉ)\hat{\epsilon}_\theta(z_\gamma, \gamma, S, \bar{R}):

Ldm(x)=EϵN(0,I),γ[ϵϵ^θ(zγ,γ,S,Rˉ)22].\mathcal{L}_{\mathrm{dm}}(x)= \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I),\, \gamma} \left[ \| \epsilon - \hat{\epsilon}_\theta(z_\gamma, \gamma, S, \bar{R}) \|_2^2 \right].

Only parameters θSA\theta_{SA} and cross-attention projections are updated.

2.2 Contrastive Alignment Loss

Slot-image compatibility is quantified via negative prediction error:

f(si,x):=Eϵ,γϵϵ^θ(zγ,γ,{si},Rˉ)22f(s_i, x) := -\mathbb{E}_{\epsilon, \gamma} \| \epsilon - \hat{\epsilon}_\theta(z_\gamma, \gamma, \{s_i\}, \bar{R}) \|_2^2

An InfoNCE-style contrastive loss over K=NK=N semantic and RR register slots:

LCA=i=1Nlogexp(f(si,x)/τ)j=1N+Rexp(f(sj,x)/τ)\mathcal{L}_{\mathrm{CA}} = - \sum_{i=1}^N \log \frac {\exp(f(s_i, x)/\tau)} {\sum_{j=1}^{N+R} \exp(f(s_j, x)/\tau)}

with τ>0\tau>0 as the temperature. Register slots serve as negatives, absorbing background attention.

2.3 Joint Objective

L(x)=Ldm(x)+λCALCA(x)\mathcal{L}(x) = \mathcal{L}_{\mathrm{dm}}(x) + \lambda_{\mathrm{CA}}\,\mathcal{L}_{\mathrm{CA}}(x)

where λCA{0.03,0.05}\lambda_{\mathrm{CA}} \in \{0.03, 0.05\} sets the trade-off (0.03 on COCO, 0.05 on VOC/MOVi).

2.4 Mutual Information Surrogate

Let SS (aligned) and S~\tilde{S} (mismatched) slots define:

$\Delta = \frac{1}{2} \int_{-\infty}^{\infty} \left[ \E_S \| \epsilon - \hat{\epsilon}(z_\gamma, \gamma, S) \|^2 - \E_{\tilde{S}} \| \epsilon - \hat{\epsilon}(z_\gamma, \gamma, \tilde{S}) \|^2 \right] d\gamma$

Theorem 1 relates this to mutual information (MI):

$-I(S; X) = \Delta + \E\left[ D_{KL}(q(\tilde{S}|S)\|p(\tilde{S}|S)) - D_{KL}(q(\tilde{S}|S)\|p(\tilde{S})) \right]$

Choosing q(S~S)=p(S~)q(\tilde{S}|S) = p(\tilde{S}) reduces KL terms (Corollary 1):

Δ=I(S;X)DKL(p(S)p(S)p(S,X))\Delta = -I(S; X) - D_{KL}(p(S)p(S)\,\|\,p(S, X))

Minimizing Δ\Delta approximates maximizing I(S;X)I(S; X) with an additional reverse-KL regularizer. Thus, the CODA objective Ldm+λCALCA\mathcal{L}_{dm} + \lambda_{CA}\,\mathcal{L}_{CA} is a practical, sample-based estimator for mutual-information maximization.

3. Algorithmic Details

The implementation follows a clear sequence of operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Input: image x
Hyperparams: N slots, R register slots, τ, λ_CA
Pretrained: DINOv2 encoder E_v, SD auto-encoder E_vae/D_vae, SD U-Net (frozen except cross-attn projections)

1.  z  E_vae(x)                        # latent
2.  h  E_v(x)                          # DINOv2 features
3.  S  SlotAttention(h; N)             # semantic slots
4.  C = concat(S, bar_R)                # conditioning set
5.  Sample εN(0,I), γUniform log-SNR
6.  z_γ = σ(γ)·z + σ(γ)·ε
7.  ε̂_cond = U_Net(z_γ, γ; keys/vals from C)
8.  L_dm = ε - ε̂_cond²
9.  # Hard-negatives: Replace half S with slots from another image x′
10. S  sample slots from x
11. \tilde S  combine S, S (shared init)
12. C_neg = concat(\tilde S, bar_R)
13. ε̂_neg = U_Net(z_γ, γ; keys/vals from C_neg)
14. f_i = -ε - ε̂({s_i}, bar_R)²    # for i=1…N+R
15. L_CA = -_{i=1}^N log exp(f_i/τ)/_{j=1}^{N+R} exp(f_j/τ)
16. L = L_dm + λ_CA·L_CA
17. Backpropagate L, update Slot Attention and cross-attn projections only

4. Empirical Performance and Ablation

CODA demonstrates measurable improvements over strong baselines on diverse object-centric benchmarks. Results are summarized as follows.

4.1 Unsupervised Object Discovery

Dataset Metric SlotAdapt CODA Δ
VOC FG-ARI 29.6% 32.23% +2.63
mBOᶦ 51.5% 55.38% +3.88
mIOUᶦ 50.77% +3.97
mBOᶜ 51.9% 61.32% +9.42
mIOUᶜ 56.30% +7.00
COCO FG-ARI 41.4% 47.54% +6.14
mBOᶦ 35.1% 36.61% +1.51
mIOUᶦ 36.1% 36.41% +0.31

Synthetic datasets (MOVi-C, MOVi-E):

  • MOVi-C: FG-ARI=59.19% vs. best baseline 52.04% (+7.15%), mIoU=51.94% vs. 44.19% (+7.75%)
  • MOVi-E: FG-ARI=59.04% vs. SlotAdapt 56.45% (+2.59%), mIoU=45.21% vs. 41.85% (+3.36%)

4.2 Compositional Image Generation

Setting LSD SlotDiff. SlotAdapt CODA
Reconstruction FID 35.54 19.45 10.86 10.65
Reconstruction KID×1e3 19.09 5.85 0.39 0.35
Composition FID 167.23 64.21 40.57 31.03
Composition KID×1e3 103.48 57.31 34.38 30.44

4.3 Ablation Analysis (VOC FG-ARI)

CA Reg CA + Reg CA+Reg+CO (CODA)
15.44% 19.21% 32.23%
19.21% 19.62%
11.96% 15.48%
19.62% 47.03% 32.23%

Register slots alone (+Reg) produce an FG-ARI increase of +3.9% over frozen-U-Net baselines; addition of contrastive loss (+CO) provides a further +1.6% improvement.

5. Practical Considerations, Scalability, and Limitations

  • Computational Overhead: R=77 register slots add ~0.02% per-step GPU time. Only Slot Attention (million-level parameters) and cross-attn projections are updated; most of SD remains frozen.
  • Scalability: Register slots and contrastive term generalize to larger diffusion backbones (e.g., SDXL, DiT) with no required architectural changes. Semantic slot count NN is user-controlled; register slots absorb residuals.
  • Limitations: Slot count NN must be selected a priori; future work could include adaptive slot numbers. Reliance on DINOv2 and SD v1.5 may entail dataset bias and challenges for out-of-domain generalization. High-quality pixel-level reconstruction is hindered by the slot bottleneck. Extensions to larger diffusion/transformer models (SDXL, FLUX, DiTs) are suggested as promising directions.

In sum, CODA’s performance gains stem from three main innovations: register slots to isolate background attention, lightweight cross-attention fine-tuning to reduce text bias, and contrastive loss as a mutual information maximization surrogate. Collectively, CODA achieves state-of-the-art object-centric segmentation and compositional generation in both synthetic and real-world settings (Nguyen et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Contrastive Object-centric Diffusion Alignment (CODA).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube