Contrastive Object-Centric Diffusion Alignment

Updated 6 January 2026

The paper introduces CODA, which integrates slot attention with a frozen diffusion backbone and contrastive alignment loss to address slot entanglement and improve object-to-slot correspondence.
Its architecture leverages DINOv2 for feature extraction and register slots to absorb ambiguous background attention, enabling efficient fine-tuning of cross-attention projections.
Empirical results show significant improvements in object discovery and compositional image generation across synthetic and real-world benchmarks, with notable gains in FG-ARI and reconstruction metrics.

@@@@10@@@@ Object-centric Diffusion Alignment (CODA) is an augmentation to object-centric learning (OCL) frameworks that integrates slot attention mechanisms with pretrained diffusion models. CODA addresses critical challenges in OCL, specifically slot entanglement and weak slot-image correspondence, by introducing register slots to capture residual attention and applying a contrastive alignment loss to promote explicit object-to-slot assignments. This joint strategy strengthens mutual information between slot representations and input images, leading to improved object discovery, property prediction, and compositional generation performance across both synthetic and real-world visual domains (Nguyen et al., 3 Jan 2026).

1. System Architecture

CODA is constructed atop a frozen Stable Diffusion v1.5 denoising backbone and a DINOv2 (ViT-B/14) vision encoder. The pipeline can be described as:

An input image $x \in \mathbb{R}^{H \times W \times 3}$ is encoded via DINOv2 into $M$ feature vectors $\{h_1, \ldots, h_M\} \in \mathbb{R}^{M \times D_{in}}$ .
Slot Attention (SA) iteratively refines $N$ randomly initialized slot queries $S = [s_1, \ldots, s_N] \in \mathbb{R}^{N \times D_{slot}}$ , yielding $N$ object-centric vectors.
$R$ register slots $\bar{R} \in \mathbb{R}^{R \times D_{slot}}$ , obtained from encoding only padding tokens with the frozen CLIP text encoder from Stable Diffusion, are prepended to the slots.
At each U-Net cross-attention layer, the key/value set is $[S;\bar{R}] \in \mathbb{R}^{(N+R) \times D_{slot}}$ .
The softmax operation across $N+R$ slots channels ambiguous/background attention to the register slots, insulating semantic slots from interference.
With all U-Net weights frozen except for the key, value, and output projections in every cross-attention layer, the fine-tuning is limited and computationally efficient.

This architecture is visualized as: $x \rightarrow \text{DINOv2} \rightarrow \text{SA} \rightarrow S$ ; $S + \bar{R} \rightarrow \text{cross-attn in U-Net} \rightarrow$ denoising prediction $\rightarrow L_{dm}$ , while negative slot sets $\tilde{S}$ yield the contrastive loss $L_{CA}$ .

2. Training Objective and Mathematical Formulation

CODA’s objective combines diffusion reconstruction and contrastive alignment losses:

2.1 Diffusion Reconstruction Loss

Given SD latents $z = E_{\text{vae}}(x)$ and noisy latents $z_\gamma = \sqrt{\sigma(\gamma)} z + \sqrt{\sigma(-\gamma)} \epsilon$ at log-SNR $\gamma$ , the U-Net predicts noise $\hat{\epsilon}_\theta(z_\gamma, \gamma, S, \bar{R})$ :

$\mathcal{L}_{\mathrm{dm}}(x)= \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I),\, \gamma} \left[ \| \epsilon - \hat{\epsilon}_\theta(z_\gamma, \gamma, S, \bar{R}) \|_2^2 \right].$

Only parameters $\theta_{SA}$ and cross-attention projections are updated.

2.2 Contrastive Alignment Loss

Slot-image compatibility is quantified via negative prediction error:

$f(s_i, x) := -\mathbb{E}_{\epsilon, \gamma} \| \epsilon - \hat{\epsilon}_\theta(z_\gamma, \gamma, \{s_i\}, \bar{R}) \|_2^2$

An InfoNCE-style contrastive loss over $K=N$ semantic and $R$ register slots:

$\mathcal{L}_{\mathrm{CA}} = - \sum_{i=1}^N \log \frac {\exp(f(s_i, x)/\tau)} {\sum_{j=1}^{N+R} \exp(f(s_j, x)/\tau)}$

with $\tau>0$ as the temperature. Register slots serve as negatives, absorbing background attention.

2.3 Joint Objective

$\mathcal{L}(x) = \mathcal{L}_{\mathrm{dm}}(x) + \lambda_{\mathrm{CA}}\,\mathcal{L}_{\mathrm{CA}}(x)$

where $\lambda_{\mathrm{CA}} \in \{0.03, 0.05\}$ sets the trade-off (0.03 on COCO, 0.05 on VOC/MOVi).

2.4 Mutual Information Surrogate

Let $S$ (aligned) and $\tilde{S}$ (mismatched) slots define:

$\Delta = \frac{1}{2} \int_{-\infty}^{\infty} \left[ \E_S \| \epsilon - \hat{\epsilon}(z_\gamma, \gamma, S) \|^2 - \E_{\tilde{S}} \| \epsilon - \hat{\epsilon}(z_\gamma, \gamma, \tilde{S}) \|^2 \right] d\gamma$

Theorem 1 relates this to mutual information (MI):

$-I(S; X) = \Delta + \E\left[ D_{KL}(q(\tilde{S}|S)\|p(\tilde{S}|S)) - D_{KL}(q(\tilde{S}|S)\|p(\tilde{S})) \right]$

Choosing $q(\tilde{S}|S) = p(\tilde{S})$ reduces KL terms (Corollary 1):

$\Delta = -I(S; X) - D_{KL}(p(S)p(S)\,\|\,p(S, X))$

Minimizing $\Delta$ approximates maximizing $I(S; X)$ with an additional reverse-KL regularizer. Thus, the CODA objective $\mathcal{L}_{dm} + \lambda_{CA}\,\mathcal{L}_{CA}$ is a practical, sample-based estimator for mutual-information maximization.

3. Algorithmic Details

The implementation follows a clear sequence of operations:

Input: image x
Hyperparams: N slots, R register slots, τ, λ_CA
Pretrained: DINOv2 encoder E_v, SD auto-encoder E_vae/D_vae, SD U-Net (frozen except cross-attn projections)

1.  z ← E_vae(x)                        # latent
2.  h ← E_v(x)                          # DINOv2 features
3.  S ← SlotAttention(h; N)             # semantic slots
4.  C = concat(S, bar_R)                # conditioning set
5.  Sample ε∼N(0,I), γ∼Uniform log-SNR
6.  z_γ = √σ(γ)·z + √σ(−γ)·ε
7.  ε̂_cond = U_Net(z_γ, γ; keys/vals from C)
8.  L_dm = ‖ε - ε̂_cond‖²
9.  # Hard-negatives: Replace half S with slots from another image x′
10. S′ ← sample slots from x′
11. \tilde S ← combine S, S′ (shared init)
12. C_neg = concat(\tilde S, bar_R)
13. ε̂_neg = U_Net(z_γ, γ; keys/vals from C_neg)
14. f_i = -‖ε - ε̂({s_i}, bar_R)‖²    # for i=1…N+R
15. L_CA = -∑_{i=1}^N log exp(f_i/τ)/∑_{j=1}^{N+R} exp(f_j/τ)
16. L = L_dm + λ_CA·L_CA
17. Backpropagate L, update Slot Attention and cross-attn projections only

4. Empirical Performance and Ablation

CODA demonstrates measurable improvements over strong baselines on diverse object-centric benchmarks. Results are summarized as follows.

4.1 Unsupervised Object Discovery

Dataset	Metric	SlotAdapt	CODA	Δ
VOC	FG-ARI	29.6%	32.23%	+2.63
	mBOᶦ	51.5%	55.38%	+3.88
	mIOUᶦ	—	50.77%	+3.97
	mBOᶜ	51.9%	61.32%	+9.42
	mIOUᶜ	—	56.30%	+7.00
COCO	FG-ARI	41.4%	47.54%	+6.14
	mBOᶦ	35.1%	36.61%	+1.51
	mIOUᶦ	36.1%	36.41%	+0.31

Synthetic datasets (MOVi-C, MOVi-E):

MOVi-C: FG-ARI=59.19% vs. best baseline 52.04% (+7.15%), mIoU=51.94% vs. 44.19% (+7.75%)
MOVi-E: FG-ARI=59.04% vs. SlotAdapt 56.45% (+2.59%), mIoU=45.21% vs. 41.85% (+3.36%)

4.2 Compositional Image Generation

Setting	LSD	SlotDiff.	SlotAdapt	CODA
Reconstruction FID	35.54	19.45	10.86	10.65
Reconstruction KID×1e3	19.09	5.85	0.39	0.35
Composition FID	167.23	64.21	40.57	31.03
Composition KID×1e3	103.48	57.31	34.38	30.44

4.3 Ablation Analysis (VOC FG-ARI)

CA	Reg	CA + Reg	CA+Reg+CO (CODA)
15.44%	—	19.21%	32.23%
—	19.21%	19.62%	—
11.96%	—	15.48%	—
19.62%	47.03%	—	32.23%

Register slots alone (+Reg) produce an FG-ARI increase of +3.9% over frozen-U-Net baselines; addition of contrastive loss (+CO) provides a further +1.6% improvement.

5. Practical Considerations, Scalability, and Limitations

Computational Overhead: R=77 register slots add ~0.02% per-step GPU time. Only Slot Attention (million-level parameters) and cross-attn projections are updated; most of SD remains frozen.
Scalability: Register slots and contrastive term generalize to larger diffusion backbones (e.g., SDXL, DiT) with no required architectural changes. Semantic slot count $N$ is user-controlled; register slots absorb residuals.
Limitations: Slot count $N$ must be selected a priori; future work could include adaptive slot numbers. Reliance on DINOv2 and SD v1.5 may entail dataset bias and challenges for out-of-domain generalization. High-quality pixel-level reconstruction is hindered by the slot bottleneck. Extensions to larger diffusion/transformer models (SDXL, FLUX, DiTs) are suggested as promising directions.

In sum, CODA’s performance gains stem from three main innovations: register slots to isolate background attention, lightweight cross-attention fine-tuning to reduce text bias, and contrastive loss as a mutual information maximization surrogate. Collectively, CODA achieves state-of-the-art object-centric segmentation and compositional generation in both synthetic and real-world settings (Nguyen et al., 3 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Object-centric Diffusion Alignment (CODA).