Papers
Topics
Authors
Recent
Search
2000 character limit reached

Coevolving Representation Diffusion (CoReDi)

Updated 24 April 2026
  • The paper introduces a unified framework where the learnable semantic projection coevolves with the diffusion process, significantly boosting convergence speed and sample quality.
  • It employs a dynamic linear projection with batch normalization and regularization (VF, Ortho, Cov) to prevent feature collapse while maintaining diverse channel outputs.
  • Experimental results on latent and pixel-space diffusion models show marked improvements in FID scores, achieving competitive performance with fewer training iterations.

Coevolving Representation Diffusion (CoReDi) is a unified framework for joint image–feature generative modeling in which the semantic representation space co-adapts with the diffusion model, rather than remaining fixed throughout training. CoReDi targets the limitations of previous approaches—where semantic features are projected into a static, low-dimensional space prior to joint diffusion—and demonstrates that allowing this space to evolve yields substantive gains in both convergence speed and sample quality. The methodology is applicable to both VAE latent diffusion and pixel-space diffusion models, and experiments on large-scale benchmarks such as ImageNet256 provide empirical support for its effectiveness (Kouzelis et al., 19 Apr 2026).

1. Motivation for Adaptive Representations

Standard joint image–feature diffusion models, such as ReDi (Kouzelis et al., 22 Apr 2025), operate by projecting pretrained visual encoder features z0∈RL×Dz_0 \in \mathbb{R}^{L \times D} into a lower-dimensional semantic space using a fixed projection P∈RD×dP \in \mathbb{R}^{D \times d}, frequently computed by PCA. This static mapping produces compressed representations z~0=z0P\tilde z_0 = z_0 P that participate in a shared diffusion process with VAE latents x0x_0. The forward process adds noise independently to both streams: xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z, and the network learns to predict velocities vθx,vθzv^x_\theta, v^z_\theta by minimizing the squared errors relative to the noise residuals. However, when PP is fixed, the semantic space lacks the capacity to specialize for the generative task, constraining both training efficiency and achievable fidelity. CoReDi removes this bottleneck by making the semantic projection a learnable function gϕg_\phi, allowing the representation space to evolve in tandem with the diffusion objective (Kouzelis et al., 19 Apr 2026).

2. Mathematical Formulation

Let x0x_0 denote either the VAE or pixel latent associated with an image, and z0=VE(x0)∈RL×Dz_0 = \mathrm{VE}(x_0) \in \mathbb{R}^{L \times D} the set of high-dimensional features from a frozen encoder (e.g., DINOv2). CoReDi introduces a learnable linear projection P∈RD×dP \in \mathbb{R}^{D \times d}0, replacing the fixed PCA step: P∈RD×dP \in \mathbb{R}^{D \times d}1 followed by channelwise batch normalization without affine parameters: P∈RD×dP \in \mathbb{R}^{D \times d}2 The forward diffusion dynamics for the joint state are: P∈RD×dP \in \mathbb{R}^{D \times d}3 Velocity heads P∈RD×dP \in \mathbb{R}^{D \times d}4 and P∈RD×dP \in \mathbb{R}^{D \times d}5 predict the respective denoising terms. Crucially, CoReDi prevents degenerate evolution (e.g., feature collapse) by constructing the representation loss target using a stop-gradient: P∈RD×dP \in \mathbb{R}^{D \times d}6 The loss for the image stream is: P∈RD×dP \in \mathbb{R}^{D \times d}7 To forestall channel collapse or redundancy, an explicit regularization term P∈RD×dP \in \mathbb{R}^{D \times d}8 is included, with three investigated forms:

Name Regularization Term Effect
Feature Variance (VF) P∈RD×dP \in \mathbb{R}^{D \times d}9 Ensures non-trivial channelwise variance per spatial vector
Orthogonality (Ortho) z~0=z0P\tilde z_0 = z_0 P0 Promotes orthogonality in z~0=z0P\tilde z_0 = z_0 P1 for decorrelated channels
Covariance (Cov) z~0=z0P\tilde z_0 = z_0 P2 Minimizes off-diagonal channel covariance across spatial locations

The final loss function is: z~0=z0P\tilde z_0 = z_0 P3 Gradients from z~0=z0P\tilde z_0 = z_0 P4 and z~0=z0P\tilde z_0 = z_0 P5 affect z~0=z0P\tilde z_0 = z_0 P6; z~0=z0P\tilde z_0 = z_0 P7 (excluding the stop-gradient term) and z~0=z0P\tilde z_0 = z_0 P8 modulate z~0=z0P\tilde z_0 = z_0 P9.

3. Coevolution Algorithm and Gradient Flow

CoReDi employs a synchronized optimization protocol for simultaneous evolution of the diffusion model (x0x_00) and semantic projection (x0x_01). A high-level pseudocode summary:

vθx,vθzv^x_\theta, v^z_\theta6

Batch normalization statistics are maintained via exponentially moving averages. Typical hyperparameters for ImageNet256 include x0x_02 (latent space), x0x_03 (pixel space), x0x_04, with equal learning rates for x0x_05 and x0x_06 (x0x_07).

4. Stability Mechanisms and Theoretical Intuition

Unconstrained learning of x0x_08 risks degenerate solutions, such as all-zero or constant channel outputs. CoReDi integrates several stabilizers:

  1. Stop-gradient in representation loss: Ensures x0x_09 cannot simply track its own evolving output, breaking degeneracy.
  2. Batch normalization (zero mean, unit variance): Suppresses scale collapse and restricts trivial fixed-point attractors.
  3. Explicit regularization (VF, Ortho, Cov): Promotes channelwise diversity and orthogonality.

Empirical ablation demonstrates that omitting any stabilization leads to catastrophic feature collapse or divergent objectives (e.g., FID diverges without batch norm, is xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,0 without stop-gradient, and xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,1 without explicit regularization). This structure is essential for stable coevolution of semantic space and diffusion model (Kouzelis et al., 19 Apr 2026).

5. Experimental Evaluation

CoReDi is validated on both latent-space and pixel-space diffusion on ImageNet256, using frozen visual encoders such as DINOv2, MOCOv3, SigLIPv2, and MAE.

Latent-Space Diffusion

A comparison of ReDi versus CoReDi across two scales:

Model Parameters Iterations FID↓
SiT-B/2 130M 400K 33.0
ReDi-B/2 130M 400K 21.4
CoReDi-B/2 130M 200K 24.7
CoReDi-B/2 130M 400K 16.4
SiT-XL/2 675M 7M 8.3
REPA-XL/2 675M 4M 5.9
ReDi-XL/2 675M 4M 3.3
CoReDi-XL/2 675M 2M 3.3

CoReDi matches or surpasses baseline FID, requiring half the standard training iterations. With classifier-free guidance, CoReDi-XL/2 achieves FID xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,2 in xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,3 epochs (versus xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,4 for REPA, xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,5 for ReDi) (Kouzelis et al., 19 Apr 2026).

Pixel-Space Diffusion

For DeCo-L/16 (pixel-space Diffusion model) on ImageNet256:

Model Iter. Params FID↓
DeCo-L/16 100K 426M 46.0*
DeCo-L/16 200K 426M 31.3
CoReDi-L/16 100K 426M 31.5
CoReDi-L/16 200K 426M 21.5

CoReDi achieves an equivalent FID in half the iterations compared to the DeCo baseline. Regularization and loss weight ablations show the results are robust to hyperparameter selection, yielding minor FID variation across xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,6 and optimal xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,7 in pixel space.

Representation Encoder Variation

Performance improvements of xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,8–xt=(1−t)x0+tϵx,z~t=(1−t)z~0+tϵz,x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,9 FID points over fixed ReDi projections are observed for all tested frozen visual encoders, confirming the robustness of adaptive representation learning.

Evolution of Representation Structure

Analysis using Local vs. Distant Similarity (LDS), Correlation Decay Slope (CDS), and RMS Spatial Contrast (RMSC) metrics shows that learned representations increasingly develop spatial structure and self-similarity over the course of training, surpassing static PCA projections. This dynamic structuring correlates with improved sample quality and convergence.

6. Dynamics of Learned Semantic Space

The adaptive mapping vθx,vθzv^x_\theta, v^z_\theta0 specializes the semantic projection channels to complement low-level VAE latents for image synthesis. Early-stage channels are noisy and unstructured; as the objective jointly sculpts vθx,vθzv^x_\theta, v^z_\theta1 and vθx,vθzv^x_\theta, v^z_\theta2, each channel organizes into spatially coherent, semantically meaningful patterns. This coadaptation addresses a major limitation of fixed-projection approaches, making the generative process more responsive to semantic alignment and ultimately achieving faster, higher-quality synthesis.

7. Limitations and Future Research

Observed limitations of CoReDi include:

  • Sensitivity to loss weights vθx,vθzv^x_\theta, v^z_\theta3 and vθx,vθzv^x_\theta, v^z_\theta4, with parameter tuning required for pixel and latent regimes;
  • Restriction to linear projections vθx,vθzv^x_\theta, v^z_\theta5; incorporating shallow nonlinear mappings or deeper adapters is a potential direction;
  • Rigid freezing of the visual encoder; slow momentum-based joint fine-tuning may further enhance coadaptation.

A plausible implication is that principles underlying CoReDi—allowing representation spaces to coevolve with generative models—could generalize to other domains such as multimodal and self-supervised diffusion settings, potentially unlocking further advances in representation-aware generation (Kouzelis et al., 19 Apr 2026, Kouzelis et al., 22 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Coevolving Representation Diffusion (CoReDi).