Contrastive Object-Centric Diffusion Alignment
- The paper introduces CODA, which integrates slot attention with a frozen diffusion backbone and contrastive alignment loss to address slot entanglement and improve object-to-slot correspondence.
- Its architecture leverages DINOv2 for feature extraction and register slots to absorb ambiguous background attention, enabling efficient fine-tuning of cross-attention projections.
- Empirical results show significant improvements in object discovery and compositional image generation across synthetic and real-world benchmarks, with notable gains in FG-ARI and reconstruction metrics.
Contrastive Object-centric Diffusion Alignment (CODA) is an augmentation to object-centric learning (OCL) frameworks that integrates slot attention mechanisms with pretrained diffusion models. CODA addresses critical challenges in OCL, specifically slot entanglement and weak slot-image correspondence, by introducing register slots to capture residual attention and applying a contrastive alignment loss to promote explicit object-to-slot assignments. This joint strategy strengthens mutual information between slot representations and input images, leading to improved object discovery, property prediction, and compositional generation performance across both synthetic and real-world visual domains (Nguyen et al., 3 Jan 2026).
1. System Architecture
CODA is constructed atop a frozen Stable Diffusion v1.5 denoising backbone and a DINOv2 (ViT-B/14) vision encoder. The pipeline can be described as:
- An input image is encoded via DINOv2 into feature vectors .
- Slot Attention (SA) iteratively refines randomly initialized slot queries , yielding object-centric vectors.
- register slots , obtained from encoding only padding tokens with the frozen CLIP text encoder from Stable Diffusion, are prepended to the slots.
- At each U-Net cross-attention layer, the key/value set is .
- The softmax operation across slots channels ambiguous/background attention to the register slots, insulating semantic slots from interference.
- With all U-Net weights frozen except for the key, value, and output projections in every cross-attention layer, the fine-tuning is limited and computationally efficient.
This architecture is visualized as: %%%%10%%%%; denoising prediction , while negative slot sets yield the contrastive loss .
2. Training Objective and Mathematical Formulation
CODA’s objective combines diffusion reconstruction and contrastive alignment losses:
2.1 Diffusion Reconstruction Loss
Given SD latents and noisy latents at log-SNR , the U-Net predicts noise :
Only parameters and cross-attention projections are updated.
2.2 Contrastive Alignment Loss
Slot-image compatibility is quantified via negative prediction error:
An InfoNCE-style contrastive loss over semantic and register slots:
with as the temperature. Register slots serve as negatives, absorbing background attention.
2.3 Joint Objective
where sets the trade-off (0.03 on COCO, 0.05 on VOC/MOVi).
2.4 Mutual Information Surrogate
Let (aligned) and (mismatched) slots define:
$\Delta = \frac{1}{2} \int_{-\infty}^{\infty} \left[ \E_S \| \epsilon - \hat{\epsilon}(z_\gamma, \gamma, S) \|^2 - \E_{\tilde{S}} \| \epsilon - \hat{\epsilon}(z_\gamma, \gamma, \tilde{S}) \|^2 \right] d\gamma$
Theorem 1 relates this to mutual information (MI):
$-I(S; X) = \Delta + \E\left[ D_{KL}(q(\tilde{S}|S)\|p(\tilde{S}|S)) - D_{KL}(q(\tilde{S}|S)\|p(\tilde{S})) \right]$
Choosing reduces KL terms (Corollary 1):
Minimizing approximates maximizing with an additional reverse-KL regularizer. Thus, the CODA objective is a practical, sample-based estimator for mutual-information maximization.
3. Algorithmic Details
The implementation follows a clear sequence of operations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Input: image x Hyperparams: N slots, R register slots, τ, λ_CA Pretrained: DINOv2 encoder E_v, SD auto-encoder E_vae/D_vae, SD U-Net (frozen except cross-attn projections) 1. z ← E_vae(x) # latent 2. h ← E_v(x) # DINOv2 features 3. S ← SlotAttention(h; N) # semantic slots 4. C = concat(S, bar_R) # conditioning set 5. Sample ε∼N(0,I), γ∼Uniform log-SNR 6. z_γ = √σ(γ)·z + √σ(−γ)·ε 7. ε̂_cond = U_Net(z_γ, γ; keys/vals from C) 8. L_dm = ‖ε - ε̂_cond‖² 9. # Hard-negatives: Replace half S with slots from another image x′ 10. S′ ← sample slots from x′ 11. \tilde S ← combine S, S′ (shared init) 12. C_neg = concat(\tilde S, bar_R) 13. ε̂_neg = U_Net(z_γ, γ; keys/vals from C_neg) 14. f_i = -‖ε - ε̂({s_i}, bar_R)‖² # for i=1…N+R 15. L_CA = -∑_{i=1}^N log exp(f_i/τ)/∑_{j=1}^{N+R} exp(f_j/τ) 16. L = L_dm + λ_CA·L_CA 17. Backpropagate L, update Slot Attention and cross-attn projections only |
4. Empirical Performance and Ablation
CODA demonstrates measurable improvements over strong baselines on diverse object-centric benchmarks. Results are summarized as follows.
4.1 Unsupervised Object Discovery
| Dataset | Metric | SlotAdapt | CODA | Δ |
|---|---|---|---|---|
| VOC | FG-ARI | 29.6% | 32.23% | +2.63 |
| mBOᶦ | 51.5% | 55.38% | +3.88 | |
| mIOUᶦ | — | 50.77% | +3.97 | |
| mBOᶜ | 51.9% | 61.32% | +9.42 | |
| mIOUᶜ | — | 56.30% | +7.00 | |
| COCO | FG-ARI | 41.4% | 47.54% | +6.14 |
| mBOᶦ | 35.1% | 36.61% | +1.51 | |
| mIOUᶦ | 36.1% | 36.41% | +0.31 |
Synthetic datasets (MOVi-C, MOVi-E):
- MOVi-C: FG-ARI=59.19% vs. best baseline 52.04% (+7.15%), mIoU=51.94% vs. 44.19% (+7.75%)
- MOVi-E: FG-ARI=59.04% vs. SlotAdapt 56.45% (+2.59%), mIoU=45.21% vs. 41.85% (+3.36%)
4.2 Compositional Image Generation
| Setting | LSD | SlotDiff. | SlotAdapt | CODA |
|---|---|---|---|---|
| Reconstruction FID | 35.54 | 19.45 | 10.86 | 10.65 |
| Reconstruction KID×1e3 | 19.09 | 5.85 | 0.39 | 0.35 |
| Composition FID | 167.23 | 64.21 | 40.57 | 31.03 |
| Composition KID×1e3 | 103.48 | 57.31 | 34.38 | 30.44 |
4.3 Ablation Analysis (VOC FG-ARI)
| CA | Reg | CA + Reg | CA+Reg+CO (CODA) |
|---|---|---|---|
| 15.44% | — | 19.21% | 32.23% |
| — | 19.21% | 19.62% | — |
| 11.96% | — | 15.48% | — |
| 19.62% | 47.03% | — | 32.23% |
Register slots alone (+Reg) produce an FG-ARI increase of +3.9% over frozen-U-Net baselines; addition of contrastive loss (+CO) provides a further +1.6% improvement.
5. Practical Considerations, Scalability, and Limitations
- Computational Overhead: R=77 register slots add ~0.02% per-step GPU time. Only Slot Attention (million-level parameters) and cross-attn projections are updated; most of SD remains frozen.
- Scalability: Register slots and contrastive term generalize to larger diffusion backbones (e.g., SDXL, DiT) with no required architectural changes. Semantic slot count is user-controlled; register slots absorb residuals.
- Limitations: Slot count must be selected a priori; future work could include adaptive slot numbers. Reliance on DINOv2 and SD v1.5 may entail dataset bias and challenges for out-of-domain generalization. High-quality pixel-level reconstruction is hindered by the slot bottleneck. Extensions to larger diffusion/transformer models (SDXL, FLUX, DiTs) are suggested as promising directions.
In sum, CODA’s performance gains stem from three main innovations: register slots to isolate background attention, lightweight cross-attention fine-tuning to reduce text bias, and contrastive loss as a mutual information maximization surrogate. Collectively, CODA achieves state-of-the-art object-centric segmentation and compositional generation in both synthetic and real-world settings (Nguyen et al., 3 Jan 2026).