Coevolving Representation Diffusion (CoReDi)

Updated 24 April 2026

The paper introduces a unified framework where the learnable semantic projection coevolves with the diffusion process, significantly boosting convergence speed and sample quality.
It employs a dynamic linear projection with batch normalization and regularization (VF, Ortho, Cov) to prevent feature collapse while maintaining diverse channel outputs.
Experimental results on latent and pixel-space diffusion models show marked improvements in FID scores, achieving competitive performance with fewer training iterations.

Coevolving Representation Diffusion (CoReDi) is a unified framework for joint image–feature generative modeling in which the semantic representation space co-adapts with the diffusion model, rather than remaining fixed throughout training. CoReDi targets the limitations of previous approaches—where semantic features are projected into a static, low-dimensional space prior to joint diffusion—and demonstrates that allowing this space to evolve yields substantive gains in both convergence speed and sample quality. The methodology is applicable to both VAE latent diffusion and pixel-space diffusion models, and experiments on large-scale benchmarks such as ImageNet256 provide empirical support for its effectiveness (Kouzelis et al., 19 Apr 2026).

1. Motivation for Adaptive Representations

Standard joint image–feature diffusion models, such as ReDi (Kouzelis et al., 22 Apr 2025), operate by projecting pretrained visual encoder features $z_0 \in \mathbb{R}^{L \times D}$ into a lower-dimensional semantic space using a fixed projection $P \in \mathbb{R}^{D \times d}$ , frequently computed by PCA. This static mapping produces compressed representations $\tilde z_0 = z_0 P$ that participate in a shared diffusion process with VAE latents $x_0$ . The forward process adds noise independently to both streams: $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ and the network learns to predict velocities $v^x_\theta, v^z_\theta$ by minimizing the squared errors relative to the noise residuals. However, when $P$ is fixed, the semantic space lacks the capacity to specialize for the generative task, constraining both training efficiency and achievable fidelity. CoReDi removes this bottleneck by making the semantic projection a learnable function $g_\phi$ , allowing the representation space to evolve in tandem with the diffusion objective (Kouzelis et al., 19 Apr 2026).

2. Mathematical Formulation

Let $x_0$ denote either the VAE or pixel latent associated with an image, and $z_0 = \mathrm{VE}(x_0) \in \mathbb{R}^{L \times D}$ the set of high-dimensional features from a frozen encoder (e.g., DINOv2). CoReDi introduces a learnable linear projection $P \in \mathbb{R}^{D \times d}$ 0, replacing the fixed PCA step: $P \in \mathbb{R}^{D \times d}$ 1 followed by channelwise batch normalization without affine parameters: $P \in \mathbb{R}^{D \times d}$ 2 The forward diffusion dynamics for the joint state are: $P \in \mathbb{R}^{D \times d}$ 3 Velocity heads $P \in \mathbb{R}^{D \times d}$ 4 and $P \in \mathbb{R}^{D \times d}$ 5 predict the respective denoising terms. Crucially, CoReDi prevents degenerate evolution (e.g., feature collapse) by constructing the representation loss target using a stop-gradient: $P \in \mathbb{R}^{D \times d}$ 6 The loss for the image stream is: $P \in \mathbb{R}^{D \times d}$ 7 To forestall channel collapse or redundancy, an explicit regularization term $P \in \mathbb{R}^{D \times d}$ 8 is included, with three investigated forms:

Name	Regularization Term	Effect
Feature Variance (VF)	$P \in \mathbb{R}^{D \times d}$ 9	Ensures non-trivial channelwise variance per spatial vector
Orthogonality (Ortho)	$\tilde z_0 = z_0 P$ 0	Promotes orthogonality in $\tilde z_0 = z_0 P$ 1 for decorrelated channels
Covariance (Cov)	$\tilde z_0 = z_0 P$ 2	Minimizes off-diagonal channel covariance across spatial locations

The final loss function is: $\tilde z_0 = z_0 P$ 3 Gradients from $\tilde z_0 = z_0 P$ 4 and $\tilde z_0 = z_0 P$ 5 affect $\tilde z_0 = z_0 P$ 6; $\tilde z_0 = z_0 P$ 7 (excluding the stop-gradient term) and $\tilde z_0 = z_0 P$ 8 modulate $\tilde z_0 = z_0 P$ 9.

3. Coevolution Algorithm and Gradient Flow

CoReDi employs a synchronized optimization protocol for simultaneous evolution of the diffusion model ( $x_0$ 0) and semantic projection ( $x_0$ 1). A high-level pseudocode summary:

$v^x_\theta, v^z_\theta$ 6

Batch normalization statistics are maintained via exponentially moving averages. Typical hyperparameters for ImageNet256 include $x_0$ 2 (latent space), $x_0$ 3 (pixel space), $x_0$ 4, with equal learning rates for $x_0$ 5 and $x_0$ 6 ( $x_0$ 7).

4. Stability Mechanisms and Theoretical Intuition

Unconstrained learning of $x_0$ 8 risks degenerate solutions, such as all-zero or constant channel outputs. CoReDi integrates several stabilizers:

Stop-gradient in representation loss: Ensures $x_0$ 9 cannot simply track its own evolving output, breaking degeneracy.
Batch normalization (zero mean, unit variance): Suppresses scale collapse and restricts trivial fixed-point attractors.
Explicit regularization (VF, Ortho, Cov): Promotes channelwise diversity and orthogonality.

Empirical ablation demonstrates that omitting any stabilization leads to catastrophic feature collapse or divergent objectives (e.g., FID diverges without batch norm, is $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 0 without stop-gradient, and $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 1 without explicit regularization). This structure is essential for stable coevolution of semantic space and diffusion model (Kouzelis et al., 19 Apr 2026).

5. Experimental Evaluation

CoReDi is validated on both latent-space and pixel-space diffusion on ImageNet256, using frozen visual encoders such as DINOv2, MOCOv3, SigLIPv2, and MAE.

Latent-Space Diffusion

A comparison of ReDi versus CoReDi across two scales:

Model	Parameters	Iterations	FID↓
SiT-B/2	130M	400K	33.0
ReDi-B/2	130M	400K	21.4
CoReDi-B/2	130M	200K	24.7
CoReDi-B/2	130M	400K	16.4
SiT-XL/2	675M	7M	8.3
REPA-XL/2	675M	4M	5.9
ReDi-XL/2	675M	4M	3.3
CoReDi-XL/2	675M	2M	3.3

CoReDi matches or surpasses baseline FID, requiring half the standard training iterations. With classifier-free guidance, CoReDi-XL/2 achieves FID $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 2 in $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 3 epochs (versus $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 4 for REPA, $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 5 for ReDi) (Kouzelis et al., 19 Apr 2026).

Pixel-Space Diffusion

For DeCo-L/16 (pixel-space Diffusion model) on ImageNet256:

Model	Iter.	Params	FID↓
DeCo-L/16	100K	426M	46.0*
DeCo-L/16	200K	426M	31.3
CoReDi-L/16	100K	426M	31.5
CoReDi-L/16	200K	426M	21.5

CoReDi achieves an equivalent FID in half the iterations compared to the DeCo baseline. Regularization and loss weight ablations show the results are robust to hyperparameter selection, yielding minor FID variation across $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 6 and optimal $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 7 in pixel space.

Representation Encoder Variation

Performance improvements of $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 8– $x_t = (1-t)x_0 + t\epsilon_x, \quad \tilde z_t = (1-t)\tilde z_0 + t\epsilon_z,$ 9 FID points over fixed ReDi projections are observed for all tested frozen visual encoders, confirming the robustness of adaptive representation learning.

Evolution of Representation Structure

Analysis using Local vs. Distant Similarity (LDS), Correlation Decay Slope (CDS), and RMS Spatial Contrast (RMSC) metrics shows that learned representations increasingly develop spatial structure and self-similarity over the course of training, surpassing static PCA projections. This dynamic structuring correlates with improved sample quality and convergence.

6. Dynamics of Learned Semantic Space

The adaptive mapping $v^x_\theta, v^z_\theta$ 0 specializes the semantic projection channels to complement low-level VAE latents for image synthesis. Early-stage channels are noisy and unstructured; as the objective jointly sculpts $v^x_\theta, v^z_\theta$ 1 and $v^x_\theta, v^z_\theta$ 2, each channel organizes into spatially coherent, semantically meaningful patterns. This coadaptation addresses a major limitation of fixed-projection approaches, making the generative process more responsive to semantic alignment and ultimately achieving faster, higher-quality synthesis.

7. Limitations and Future Research

Observed limitations of CoReDi include:

Sensitivity to loss weights $v^x_\theta, v^z_\theta$ 3 and $v^x_\theta, v^z_\theta$ 4, with parameter tuning required for pixel and latent regimes;
Restriction to linear projections $v^x_\theta, v^z_\theta$ 5; incorporating shallow nonlinear mappings or deeper adapters is a potential direction;
Rigid freezing of the visual encoder; slow momentum-based joint fine-tuning may further enhance coadaptation.

A plausible implication is that principles underlying CoReDi—allowing representation spaces to coevolve with generative models—could generalize to other domains such as multimodal and self-supervised diffusion settings, potentially unlocking further advances in representation-aware generation (Kouzelis et al., 19 Apr 2026, Kouzelis et al., 22 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Coevolving Representations in Joint Image-Feature Diffusion (2026)

Boosting Generative Image Modeling via Joint Image-Feature Synthesis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Coevolving Representation Diffusion (CoReDi).