- The paper introduces Dispersive Loss, a novel regularization technique that disperses internal representations to enhance diffusion-based generative models.
- The paper leverages variants of InfoNCE, Hinge, and Covariance losses to consistently improve FID scores on benchmarks like ImageNet and CIFAR-10.
- The paper demonstrates that integrating Dispersive Loss is computationally efficient and scalable, improving performance across various model sizes and training regimes.
The paper "Diffuse and Disperse: Image Generation with Representation Regularization" (2506.09027) introduces Dispersive Loss, a novel plug-and-play regularization technique designed to enhance diffusion-based generative models. The core idea is to encourage the internal representations within these models to spread out, or disperse, in the hidden feature space. This approach draws an analogy to the repulsive effect seen in contrastive self-supervised learning but crucially, it operates without requiring positive sample pairs. This design choice simplifies implementation and avoids interference with the standard regression-based training objectives of diffusion models.
The authors position Dispersive Loss as a minimalist and self-contained solution, contrasting it with methods like Representation Alignment (REPA) (2410.06940), which rely on pre-trained models, additional parameters, and external data. Dispersive Loss, in contrast, requires no pre-training, introduces no new learnable parameters, and uses no external data, making it a lightweight addition to existing diffusion model training pipelines.
The total loss function is a weighted sum of the standard diffusion loss (LDiff) and the Dispersive Loss (LDisp):
$\mathcal{L}(X) = \mathbb{E}_{x_i \in X} [\mathcal{L}_{\text{Diff}\bigl(x_i\bigr)] + \lambda \mathcal{L}_{\text{Disp}\bigl(X\bigr)}$
where X is a batch of noisy images, and λ is a weighting hyperparameter. The Dispersive Loss is applied directly to intermediate representations extracted from the diffusion model.
Several variants of Dispersive Loss are proposed, derived by adapting existing contrastive loss functions and removing their positive-pair alignment terms:
- InfoNCE-based Dispersive Loss: This variant is derived from the InfoNCE loss. If zi and zj are intermediate representations for two samples in a batch, and D is a dissimilarity function (e.g., negative cosine similarity or squared ℓ2 distance), this loss is formulated as:
LDisp=logEi,j[exp(−D(zi,zj)/τ)]
where τ is a temperature hyperparameter. The paper notes that the squared ℓ2 distance for D performs particularly well. A simple pseudocode for this variant using squared ℓ2 distance is provided:
1
2
3
4
5
6
7
8
9
|
# Z: flattened intermediate representations (N x D)
# tau: temperature
# pdist computes pairwise distances
# mean computes the average
# log is the natural logarithm
# exp is the exponential function
def disp_loss(Z, tau):
D = pdist(Z, p=2)**2 # Pairwise squared L2 distances
return log(mean(exp(-D/tau))) |
- Hinge Loss-based Dispersive Loss: Derived from classical contrastive learning, this variant only considers the repulsion of negative pairs (all pairs in the batch, since there are no explicit positive pairs):
LDisp=Ei,j[max(0,ϵ−D(zi,zj))2]
where ϵ is a margin.
- Covariance Loss-based Dispersive Loss: Inspired by methods like Barlow Twins [zbontar2021barlow], this variant encourages the off-diagonal elements of the covariance matrix of the batch's representations to be zero:
$\mathcal{L}_{\text{Disp}} = \sum_{m,n}\Cov_{mn}^2$
assuming representations are normalized, making diagonal elements of the covariance matrix implicitly one.
The overall training process with Dispersive Loss is illustrated by the following pseudocode:
1
2
3
4
|
def loss(pred, Z, tgt, lamb):
L_diff = mean((pred - tgt)**2) # Standard diffusion loss (e.g., MSE)
L_disp = disp_loss(Z) # Dispersive loss (e.g., InfoNCE-based)
return L_diff + lamb * L_disp |
Experimental Evaluation and Key Findings:
The method was extensively evaluated on the ImageNet 256x256 dataset using DiT [dit] and SiT [sit] models.
- Superiority over Contrastive Loss: Dispersive Loss consistently improved FID scores over baselines. In contrast, directly applying standard contrastive losses (requiring two views and positive pairs) was found to be sensitive to noise augmentation strategies and sometimes degraded performance.
- Robustness of Variants: All proposed Dispersive Loss variants (InfoNCE ℓ2, InfoNCE cosine, Hinge, Covariance) showed improvements, with InfoNCE using ℓ2 distance performing best (e.g., improving SiT-B/2 FID by 11.35% from 36.49 to 32.35 without CFG).
- Block Choice: Applying Dispersive Loss to various intermediate blocks (or all blocks) yielded benefits. The best result was with all blocks, but a single block (e.g., at 1/4th depth) was nearly as good and used by default. The ℓ2 norm of representations increased throughout the network, even in blocks where the loss wasn't directly applied.
- Hyperparameter Sensitivity: The method showed robustness across a range of λ (loss weight, e.g., 0.25 to 1.0) and τ (temperature, e.g., 0.25 to 2.0) values, consistently outperforming baselines.
- Model Scalability: Dispersive Loss provided consistent FID improvements across different model sizes (S, B, L, XL) for both DiT and SiT. Notably, relative and absolute improvements were often larger for stronger baselines and larger models, suggesting effective regularization.
- Long Training Schedules: For SiT-XL/2, Dispersive Loss continued to provide significant gains (e.g., FID from 2.06 to 1.97 with SDE sampling and CFG after extensive training) even when baselines were very strong.
- Comparison with REPA: While REPA (2410.06940) achieves a slightly better FID (1.80 for SiT-XL/2), it relies on a large pre-trained model (DINOv2) and external data. Dispersive Loss achieves a competitive FID (1.97) without these dependencies.
- One-Step Generation: Applied to MeanFlow [geng2025mean] models, Dispersive Loss improved performance (e.g., MeanFlow-XL/2 FID from 3.43 to 3.21), achieving state-of-the-art results for one-step diffusion-based generation on ImageNet 256x256.
- Other Metrics/Datasets: Improvements were also observed in Inception Scores on ImageNet and FID on CIFAR-10 using U-Net architectures.
Implementation Considerations:
- Computational Overhead: The additional computational cost is negligible as Dispersive Loss operates on already computed intermediate representations from the same input batch.
- Ease of Integration: It's a "plug-and-play" module requiring minimal code changes: extracting intermediate features and adding the scalar loss term.
- No Architectural Changes: No new layers (like projection heads common in contrastive learning) or parameters are added to the model.
In conclusion, Dispersive Loss offers a simple yet effective way to regularize internal representations in diffusion models, leading to improved generation quality across various model architectures, scales, and training regimes. Its self-contained nature, requiring no external data or pre-training, makes it a practical and broadly applicable technique for enhancing generative models.