Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Published 22 Apr 2025 in cs.CV | (2504.16064v1)

Abstract: Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ReDi, a novel framework that jointly diffuses low-level image latents and high-level semantic features to improve generative performance.
It leverages a unified denoising model with token fusion and PCA-based dimensionality reduction to efficiently integrate VAE latents and DINOv2 features.
Representation Guidance further refines sample quality and accelerates training, achieving state-of-the-art FID scores on ImageNet.

This paper introduces ReDi (Representation-Diffusion), a novel framework for generative image modeling that enhances Latent Diffusion Models (LDMs) by integrating high-level semantic features directly into the diffusion process (2504.16064). The core idea is to move beyond simply aligning diffusion model features with external representations (like in REPA [Yu2025repa]) and instead jointly model both low-level image latents (from a VAE encoder, $\mathbf{x}_0$ ) and high-level semantic features (from a pretrained encoder like DINOv2, $\mathbf{z}_0$ ) within the same diffusion framework.

Methodology:

Joint Diffusion Process: The standard diffusion process is extended to handle both VAE latents ( $\mathbf{x}_0$ ) and semantic features ( $\mathbf{z}_0$ ). During the forward process, Gaussian noise is added independently to both $\mathbf{x}_0$ and $\mathbf{z}_0$ using the same noise schedule $\bar{\alpha}_t$ .

$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_x, \quad \mathbf{z}_t = \sqrt{\bar{\alpha}_t}\mathbf{z}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_z$
Unified Denoising Model: A single diffusion model (implemented using DiT or SiT architectures) takes the noisy latents $\mathbf{x}_t$ and noisy features $\mathbf{z}_t$ as input and predicts the noise added to both: $\boldsymbol{\epsilon}^x_\theta(\mathbf{x}_t, \mathbf{z}_t, t)$ and $\boldsymbol{\epsilon}^z_\theta(\mathbf{x}_t, \mathbf{z}_t, t)$ .
Joint Training Objective: The model is trained using a combined loss function that minimizes the error for both predictions:

$\mathcal{L}_{joint} = \mathbb{E}_{\mathbf{x_0}, \mathbf{z_0}, t} \Big [ \Vert \boldsymbol{\epsilon}^x_\theta(\mathbf{x_t},\mathbf{z_t}, t) - \boldsymbol{\epsilon_x} \Vert_2^2 + \lambda_z \Vert \boldsymbol{\epsilon}^z_\theta(\mathbf{x_t},\mathbf{z_t}, t) - \boldsymbol{\epsilon_z} \Vert_2^2 \Big]$

where $\lambda_z$ balances the contribution of the feature denoising loss.
Token Fusion Strategies: Two methods are explored for combining the VAE and semantic feature tokens within the transformer:
- Merged Tokens (MR): Tokens are embedded separately and then summed channel-wise before entering the transformer. This is computationally efficient as it doesn't increase the sequence length. (Default method used).
- Separate Tokens (SP): Tokens are embedded separately and concatenated along the sequence dimension. This preserves modality-specific information longer but increases computational cost (doubles sequence length).
Dimensionality Reduction: To handle the high dimensionality of raw DINOv2 features ( $C_z=768$ ) compared to VAE latents ( $C_x=16$ ), PCA is applied to the DINOv2 features to reduce their channel dimension ( $C'_z=8$ found optimal) before inputting them into the diffusion process.
Representation Guidance (RG): A novel inference-time technique is proposed to leverage the learned semantic predictions ( $\boldsymbol{\epsilon}^z_\theta$ ) to guide the image generation process ( $\boldsymbol{\epsilon}^x_\theta$ ). It modifies the predicted noise for the image latents based on the consistency between the predicted image and predicted features:

$\boldsymbol{\hat{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + w_r\left(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right)}$

where $w_r$ is the guidance scale. This requires training the model with occasional feature dropout ( $p_{drop}$ ) to learn the unconditional prediction $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ .

Implementation Details:

Uses standard DiT and SiT architectures (B/2, L/2, XL/2) with minimal modifications.
VAE: SD-VAE-FT-EMA (8x downsampling, 4 channels -> $32 \times 32 \times 4$ latents for 256x256 images). Patchified to $L=256$ tokens, $C_x=16$ channels.
Semantic Features: DINOv2-B with registers (768D), reduced to 8D via PCA. Resized and patchified to $L=256$ tokens, $C_z=32$ channels (8 PCA dims $\times$ 2x2 patch).
Training: ImageNet 256x256, batch size 256.
Sampling: DDPM (DiT) or SDE Euler–Maruyama (SiT), 250 NFE.

Key Results and Contributions:

Improved Generative Performance: ReDi significantly improves FID scores compared to baseline DiT/SiT models and the REPA method across various model sizes and training durations. For instance, SiT-XL/2 + ReDi achieves FID 5.1 at 1M iterations, while REPA needs 4M iterations for FID 5.9.
Accelerated Convergence: ReDi drastically speeds up training. DiT-XL/2 + ReDi reaches convergence ~23x faster than vanilla DiT-XL/2 and ~6x faster than SiT-XL/2 + REPA to reach comparable FID scores.
State-of-the-Art Results: Achieves competitive FID scores on ImageNet 256x256, reaching 1.64 FID with SiT-XL/2 after 600 epochs using CFG.
Representation Guidance (RG): Demonstrates that RG further improves FID in both conditional (e.g., DiT-XL/2 FID drops from 8.7 to 5.9) and unconditional settings (e.g., DiT-XL/2 FID drops from 25.1 to 22.6), making unconditional generation quality closer to conditional.
Effectiveness of Joint Modeling: Shows that explicitly modeling the joint distribution of low-level latents and high-level features is more effective than aligning features via distillation (REPA).
Simplicity: The method requires minimal architectural changes and avoids complex auxiliary loss terms like distillation objectives.

Ablation Studies:

Confirmed the benefit of PCA, finding 8 components optimal for DINOv2.
Showed Merged Tokens provides a good trade-off between performance and computational efficiency compared to Separate Tokens.
Found that applying standard Classifier-Free Guidance (CFG) only to the VAE latents yields better results than applying it to both VAE and DINOv2 representations.

In summary, ReDi presents a simple yet effective method to boost diffusion model performance and efficiency by jointly modeling image latents and pre-computed semantic features. The introduction of Representation Guidance provides an additional mechanism to improve sample quality, especially for unconditional generation.