Coevolving Representation Diffusion (CoReDi)
- The paper introduces a unified framework where the learnable semantic projection coevolves with the diffusion process, significantly boosting convergence speed and sample quality.
- It employs a dynamic linear projection with batch normalization and regularization (VF, Ortho, Cov) to prevent feature collapse while maintaining diverse channel outputs.
- Experimental results on latent and pixel-space diffusion models show marked improvements in FID scores, achieving competitive performance with fewer training iterations.
Coevolving Representation Diffusion (CoReDi) is a unified framework for joint image–feature generative modeling in which the semantic representation space co-adapts with the diffusion model, rather than remaining fixed throughout training. CoReDi targets the limitations of previous approaches—where semantic features are projected into a static, low-dimensional space prior to joint diffusion—and demonstrates that allowing this space to evolve yields substantive gains in both convergence speed and sample quality. The methodology is applicable to both VAE latent diffusion and pixel-space diffusion models, and experiments on large-scale benchmarks such as ImageNet256 provide empirical support for its effectiveness (Kouzelis et al., 19 Apr 2026).
1. Motivation for Adaptive Representations
Standard joint image–feature diffusion models, such as ReDi (Kouzelis et al., 22 Apr 2025), operate by projecting pretrained visual encoder features into a lower-dimensional semantic space using a fixed projection , frequently computed by PCA. This static mapping produces compressed representations that participate in a shared diffusion process with VAE latents . The forward process adds noise independently to both streams: and the network learns to predict velocities by minimizing the squared errors relative to the noise residuals. However, when is fixed, the semantic space lacks the capacity to specialize for the generative task, constraining both training efficiency and achievable fidelity. CoReDi removes this bottleneck by making the semantic projection a learnable function , allowing the representation space to evolve in tandem with the diffusion objective (Kouzelis et al., 19 Apr 2026).
2. Mathematical Formulation
Let denote either the VAE or pixel latent associated with an image, and the set of high-dimensional features from a frozen encoder (e.g., DINOv2). CoReDi introduces a learnable linear projection 0, replacing the fixed PCA step: 1 followed by channelwise batch normalization without affine parameters: 2 The forward diffusion dynamics for the joint state are: 3 Velocity heads 4 and 5 predict the respective denoising terms. Crucially, CoReDi prevents degenerate evolution (e.g., feature collapse) by constructing the representation loss target using a stop-gradient: 6 The loss for the image stream is: 7 To forestall channel collapse or redundancy, an explicit regularization term 8 is included, with three investigated forms:
| Name | Regularization Term | Effect |
|---|---|---|
| Feature Variance (VF) | 9 | Ensures non-trivial channelwise variance per spatial vector |
| Orthogonality (Ortho) | 0 | Promotes orthogonality in 1 for decorrelated channels |
| Covariance (Cov) | 2 | Minimizes off-diagonal channel covariance across spatial locations |
The final loss function is: 3 Gradients from 4 and 5 affect 6; 7 (excluding the stop-gradient term) and 8 modulate 9.
3. Coevolution Algorithm and Gradient Flow
CoReDi employs a synchronized optimization protocol for simultaneous evolution of the diffusion model (0) and semantic projection (1). A high-level pseudocode summary:
6
Batch normalization statistics are maintained via exponentially moving averages. Typical hyperparameters for ImageNet256 include 2 (latent space), 3 (pixel space), 4, with equal learning rates for 5 and 6 (7).
4. Stability Mechanisms and Theoretical Intuition
Unconstrained learning of 8 risks degenerate solutions, such as all-zero or constant channel outputs. CoReDi integrates several stabilizers:
- Stop-gradient in representation loss: Ensures 9 cannot simply track its own evolving output, breaking degeneracy.
- Batch normalization (zero mean, unit variance): Suppresses scale collapse and restricts trivial fixed-point attractors.
- Explicit regularization (VF, Ortho, Cov): Promotes channelwise diversity and orthogonality.
Empirical ablation demonstrates that omitting any stabilization leads to catastrophic feature collapse or divergent objectives (e.g., FID diverges without batch norm, is 0 without stop-gradient, and 1 without explicit regularization). This structure is essential for stable coevolution of semantic space and diffusion model (Kouzelis et al., 19 Apr 2026).
5. Experimental Evaluation
CoReDi is validated on both latent-space and pixel-space diffusion on ImageNet256, using frozen visual encoders such as DINOv2, MOCOv3, SigLIPv2, and MAE.
Latent-Space Diffusion
A comparison of ReDi versus CoReDi across two scales:
| Model | Parameters | Iterations | FID↓ |
|---|---|---|---|
| SiT-B/2 | 130M | 400K | 33.0 |
| ReDi-B/2 | 130M | 400K | 21.4 |
| CoReDi-B/2 | 130M | 200K | 24.7 |
| CoReDi-B/2 | 130M | 400K | 16.4 |
| SiT-XL/2 | 675M | 7M | 8.3 |
| REPA-XL/2 | 675M | 4M | 5.9 |
| ReDi-XL/2 | 675M | 4M | 3.3 |
| CoReDi-XL/2 | 675M | 2M | 3.3 |
CoReDi matches or surpasses baseline FID, requiring half the standard training iterations. With classifier-free guidance, CoReDi-XL/2 achieves FID 2 in 3 epochs (versus 4 for REPA, 5 for ReDi) (Kouzelis et al., 19 Apr 2026).
Pixel-Space Diffusion
For DeCo-L/16 (pixel-space Diffusion model) on ImageNet256:
| Model | Iter. | Params | FID↓ |
|---|---|---|---|
| DeCo-L/16 | 100K | 426M | 46.0* |
| DeCo-L/16 | 200K | 426M | 31.3 |
| CoReDi-L/16 | 100K | 426M | 31.5 |
| CoReDi-L/16 | 200K | 426M | 21.5 |
CoReDi achieves an equivalent FID in half the iterations compared to the DeCo baseline. Regularization and loss weight ablations show the results are robust to hyperparameter selection, yielding minor FID variation across 6 and optimal 7 in pixel space.
Representation Encoder Variation
Performance improvements of 8–9 FID points over fixed ReDi projections are observed for all tested frozen visual encoders, confirming the robustness of adaptive representation learning.
Evolution of Representation Structure
Analysis using Local vs. Distant Similarity (LDS), Correlation Decay Slope (CDS), and RMS Spatial Contrast (RMSC) metrics shows that learned representations increasingly develop spatial structure and self-similarity over the course of training, surpassing static PCA projections. This dynamic structuring correlates with improved sample quality and convergence.
6. Dynamics of Learned Semantic Space
The adaptive mapping 0 specializes the semantic projection channels to complement low-level VAE latents for image synthesis. Early-stage channels are noisy and unstructured; as the objective jointly sculpts 1 and 2, each channel organizes into spatially coherent, semantically meaningful patterns. This coadaptation addresses a major limitation of fixed-projection approaches, making the generative process more responsive to semantic alignment and ultimately achieving faster, higher-quality synthesis.
7. Limitations and Future Research
Observed limitations of CoReDi include:
- Sensitivity to loss weights 3 and 4, with parameter tuning required for pixel and latent regimes;
- Restriction to linear projections 5; incorporating shallow nonlinear mappings or deeper adapters is a potential direction;
- Rigid freezing of the visual encoder; slow momentum-based joint fine-tuning may further enhance coadaptation.
A plausible implication is that principles underlying CoReDi—allowing representation spaces to coevolve with generative models—could generalize to other domains such as multimodal and self-supervised diffusion settings, potentially unlocking further advances in representation-aware generation (Kouzelis et al., 19 Apr 2026, Kouzelis et al., 22 Apr 2025).