Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 77 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Ambient Diffusion Omni: Training Good Models with Bad Data (2506.10038v1)

Published 10 Jun 2025 in cs.GR, cs.AI, and cs.LG

Abstract: We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We then use our framework to achieve state-of-the-art ImageNet FID, and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.

Collections

Summary

The paper introduces Ambient Diffusion Omni, a method that uses both low-quality and OOD images by annotating noise levels to boost diffusion training.
It employs noise level and crop-based classifiers to optimally integrate data in high-noise and low-noise regimes, addressing corruptions and local patch consistency.
Empirical results show improved FID and diversity on benchmarks like CIFAR-10, FFHQ, and ImageNet by effectively balancing bias and variance.

This paper introduces "Ambient Diffusion Omni" (Ambient-o), a framework for training diffusion models that can effectively utilize low-quality, synthetic, and out-of-distribution (OOD) images, which are typically discarded during dataset curation (2506.10038). The core idea is that these "bad" data still contain valuable signals that can improve model performance, particularly when handled appropriately based on the diffusion process's noise levels.

The method exploits two key properties of natural images: their spectral power law decay and locality. It proposes different strategies for high-noise and low-noise diffusion regimes:

1. Learning in the High-Noise Regime (Leveraging Low-Quality Data)

Insight: Adding Gaussian noise contracts distributional distances. As diffusion time $t$ (and thus noise) increases, the difference between a clean image distribution ( $p_t$ ) and a corrupted image distribution ( $\tilde{p}_t$ ) diminishes. Low-quality samples become useful for training denoisers at high noise levels ( $t > t_n^{\min}$ ).
Implementation:
- Noise Level Annotation: To determine the minimum noise level $t_n^{\min}$ at which corrupted data can be used, a time-conditional classifier $c^{\text{noise}_{\theta}}(x_t, t)$ is trained. This classifier learns to distinguish between noised clean samples (from a small set $S_G$ ) and noised corrupted samples (from a set $S_B$ ).
- The classifier is trained using the objective:
  
  $J_{\mathrm{noise}}(\theta) = \sum_{x_0 \in S_G}E_{x_t | x_0}\left[ - \log c^{\rm noise}_{\theta}(x_t, t)\right] + \sum_{y_0 \in S_B}E_{y_t | y_0}\left[- \log (1 - c^{\rm noise}_{\theta}(y_t, t)) \right]$
- Sample-Dependent Annotation: Instead of a single $t_n^{\min}$ for all low-quality data, each sample $w_0^{(i)}$ can be assigned its own minimum usable noise time $t_i^{\min}$ : $t^{\min}_i = \inf\{t \in [0, T] : \mathbb{E}_{w_t | w_0^{(i)}}\left[ c^{\rm noise}_{\theta}(w_t, t) \right] > \tau\}$ , where $\tau$ is a threshold (e.g., $0.5 - \epsilon$ ). This means a sample is considered "usable" at time $t$ if the classifier is sufficiently confused about its origin when that much noise is added.
- Training Objective: The diffusion model $h_{\theta}(x_t, t)$ is then trained using an objective similar to Ambient Diffusion, but generalized for sample-specific annotation times $t_i^{\min}$ :
  
  $J_{\mathrm{ambient-o}}(\theta) = \mathbb E_{t \in \mathcal U[0, T]} \sum_{i: t^{\min}_i < t}\mathbb E_{x_t|x_{t^{\min}_i}^{(i)}} \left[ \left|\left| \alpha(t, t^{\min}_i)h_{\theta}(x_t, t) + (1 - \alpha(t, t^{\min}_i))x_t - x_{t^{\min}_i}^{(i)}\right|\right|^2\right]$
  
  where $\alpha(t, t^{\min}_i) = \frac{\sigma^2(t) - \sigma^2(t^{\min}_i)}{\sigma^2(t)}$ , and $x_{t^{\min}_i}^{(i)}$ is the $i$ -th datapoint $w_0^{(i)}$ with Gaussian noise corresponding to $\sigma(t_i^{\min})$ added.
- This effectively treats arbitrarily corrupted data as data corrupted by a known amount of additive Gaussian noise, at the cost of information loss from the added noise during annotation.
Limitations: The method works best for high-frequency corruptions (like blur) because Gaussian noise primarily dampens high frequencies. Low-frequency corruptions (color shifts, contrast reduction) are more challenging.

2. Learning in the Low-Noise Regime (Leveraging Synthetic and OOD Data)

Insight: Natural images exhibit locality. At low diffusion times, denoising primarily relies on local image regions (small receptive fields). If OOD or synthetic data share similar local patch statistics with the target distribution, their patches can be used for training at low noise levels.
Implementation:
- Crop Size Mapping: The paper establishes an empirical relationship between diffusion time $t$ and the minimal crop size $\mathrm{crop}(t)$ needed for optimal denoising at that time.
- Crops Classifier: A classifier $c^{\rm{crops}_{\theta}}$ is trained to distinguish between crops from the clean distribution $p_0$ and crops from an OOD/synthetic distribution $\tilde{p}_0$ .
- Maximum Usable Time Annotation: For each OOD sample, a maximum noise time $t^{\max}_n$ is determined, below which its crops are indistinguishable from clean data crops:
- $t^{\mathrm{max}}_{n} = \sup\left\{t \in [0, T] : {1 \over |S_B|}\sum_{y_0 \in S_B} \left[ c^{\rm{crops}}_{\theta}(A(t)(y_t)) \right] > \tau\right\}$ , where $A(t)$ is a random patch selector of size $\mathrm{crop}(t)$ .
- For $t \leq t^{\max}_n$ , OOD samples are used with the standard diffusion objective, as their local patches are considered equivalent to clean data patches.
The Donut Paradox: A sample $w_0^{(i)}$ might be usable for $t \geq t^{\min}_i$ (due to global noise making it indistinguishable) and for $t \leq t^{\max}_i$ (due to local patches being indistinguishable), but not for $t \in (t^{\max}_i, t^{\min}_i)$ . In this intermediate "donut hole" regime, the noise is insufficient to merge global distributions, but the required receptive field is large enough that patch-level differences become apparent.

Theoretical Justification

The paper provides a theoretical analysis of the bias-variance trade-off. It compares training with $n_1$ clean samples (Algorithm 1) versus $n_1+n_2$ mixed-quality samples (Algorithm 2). Key results:

Diffusion modeling is linked to Gaussian kernel density estimation.
Theorem \ref{theorem:gaussian kernel density estimation}: Provides a bound on the distance $d(p_\sigma, \hat{p}_\sigma)$ for Gaussian kernel density estimation.
Theorem \ref{theorem:smoothing brings closer} (Distance contraction under noise): Shows that $d(P\circledast\mathcal{N}(0,\sigma^2I), Q\circledast\mathcal{N}(0,\sigma^2I)) \le d(P,Q) \cdot \frac{D}{2\sigma}$ , meaning Gaussian noise reduces the distance between distributions. This implies that for sufficiently high noise $\sigma_t$ (i.e., $t \ge t^{\min}_n$ ), using a larger dataset of mixed-quality samples (Algorithm 2) can lead to better estimation of $p_t$ than using a smaller dataset of only clean samples (Algorithm 1), as the bias introduced by low-quality data is offset by the variance reduction from more samples. The classifier $c^{\text{noise}_{\theta}}$ empirically finds this $t^{\min}_n$ .

Experiments and Results

Controlled Experiments (CIFAR-10, FFHQ):
- Successfully trained models with synthetically corrupted data (Gaussian blur, JPEG compression). Ambient-o outperformed baselines (training on all data equally, or only on filtered clean data).
- For Gaussian blur on CIFAR-10 (10% clean, 90% blurred with $\sigma_B=0.6$ ), Ambient-o achieved FID 5.34, while "Only Clean" got 8.79 and "All data" got 11.42. The average annotated noise $\bar{\sigma}_{t_n}^{\min}$ was 1.38.
- The low-noise regime strategy was validated by training a dog generation model using 10% dog images and adding cat images as OOD data. This improved FID from 12.08 (dogs only) to 8.92 (dogs + cats with classifier-based crop usage).
ImageNet:
- Used CLIP-IQA to label ImageNet samples into high-quality (top 10%) and low-quality (bottom 90%).
- Ambient-o-XXL+crops (using both high-noise and low-noise strategies) achieved state-of-the-art FID on ImageNet-512:
- Test FID (no CFG): 2.78 (vs. EDM2-XXL: 2.88)
- Test FID (w/ CFG): 2.53 (vs. EDM2-XXL: 2.73)
- This demonstrates benefits on real-world datasets with naturally heterogeneous quality.
Text-to-Image (MicroDiffusion):
- Trained on a mix of datasets including DiffusionDB (lower-quality synthetic data). Ambient-o treated DiffusionDB samples by assigning them $\sigma_{\min}=2$ (i.e., only used for high-noise regime).
- Improved FID on COCO zero-shot generation from 12.37 (baseline) to 10.61 (Ambient-o).
- Showed better diversity compared to fine-tuning on only high-quality data, which tends to reduce diversity. Ambient-o achieved a >13% increase in DINO Vendi Diversity.

Practical Implementation Considerations

Classifier Training: The method requires training auxiliary classifiers ( $c^{\text{noise}_{\theta}}$ and $c^{\rm{crops}_{\theta}}$ ). This adds computational overhead but can be amortized if the classifiers are general or if annotation times are hand-picked based on quality proxies (as done in the text-to-image experiment with DiffusionDB).
Loss Pre-conditioning: The paper mentions using a pre-conditioning weight $\lambda_{\text{amb}}(\sigma, \sigma_{\min}) = \sigma^4 / (\sigma^2 - \sigma_{\min}^2)^2$ for the ambient loss, similar to EDM-2, and a buffer zone around $\sigma_{\min}$ to prevent singularities.
Data Annotation: The quality of the high-quality set $S_G$ and low-quality set $S_B$ (or the quality metric used to define them, like CLIP-IQA) is crucial, as annotations for other samples depend on their similarity to these sets.

Takeaways

Low-quality in-distribution images and high-quality out-of-distribution images can be used to produce high-quality in-distribution images.
Real datasets contain heterogeneous samples. Ambient-o explicitly accounts for this quality variability, improving generation quality.
Ambient-o treats synthetic data as a form of corrupted data, leading to superior visual quality and diversity compared to relying only on real samples or simple filtering.

The paper provides a principled way to incorporate a wider range of data into diffusion model training, potentially reducing the reliance on expensive and heuristic data filtering processes. It shows practical improvements in image quality and diversity across various tasks and datasets.