Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Ambient Diffusion Omni: Training Good Models with Bad Data (2506.10038v1)

Published 10 Jun 2025 in cs.GR, cs.AI, and cs.LG

Abstract: We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We then use our framework to achieve state-of-the-art ImageNet FID, and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Ambient Diffusion Omni, a method that uses both low-quality and OOD images by annotating noise levels to boost diffusion training.
  • It employs noise level and crop-based classifiers to optimally integrate data in high-noise and low-noise regimes, addressing corruptions and local patch consistency.
  • Empirical results show improved FID and diversity on benchmarks like CIFAR-10, FFHQ, and ImageNet by effectively balancing bias and variance.

This paper introduces "Ambient Diffusion Omni" (Ambient-o), a framework for training diffusion models that can effectively utilize low-quality, synthetic, and out-of-distribution (OOD) images, which are typically discarded during dataset curation (2506.10038). The core idea is that these "bad" data still contain valuable signals that can improve model performance, particularly when handled appropriately based on the diffusion process's noise levels.

The method exploits two key properties of natural images: their spectral power law decay and locality. It proposes different strategies for high-noise and low-noise diffusion regimes:

1. Learning in the High-Noise Regime (Leveraging Low-Quality Data)

  • Insight: Adding Gaussian noise contracts distributional distances. As diffusion time tt (and thus noise) increases, the difference between a clean image distribution (ptp_t) and a corrupted image distribution (p~t\tilde{p}_t) diminishes. Low-quality samples become useful for training denoisers at high noise levels (t>tnmint > t_n^{\min}).
  • Implementation:
    • Noise Level Annotation: To determine the minimum noise level tnmint_n^{\min} at which corrupted data can be used, a time-conditional classifier cnoiseθ(xt,t)c^{\text{noise}_{\theta}}(x_t, t) is trained. This classifier learns to distinguish between noised clean samples (from a small set SGS_G) and noised corrupted samples (from a set SBS_B).
    • The classifier is trained using the objective:

      Jnoise(θ)=x0SGExtx0[logcθnoise(xt,t)]+y0SBEyty0[log(1cθnoise(yt,t))]J_{\mathrm{noise}}(\theta) = \sum_{x_0 \in S_G}E_{x_t | x_0}\left[ - \log c^{\rm noise}_{\theta}(x_t, t)\right] + \sum_{y_0 \in S_B}E_{y_t | y_0}\left[- \log (1 - c^{\rm noise}_{\theta}(y_t, t)) \right]

    • Sample-Dependent Annotation: Instead of a single tnmint_n^{\min} for all low-quality data, each sample w0(i)w_0^{(i)} can be assigned its own minimum usable noise time timint_i^{\min}: timin=inf{t[0,T]:Ewtw0(i)[cθnoise(wt,t)]>τ}t^{\min}_i = \inf\{t \in [0, T] : \mathbb{E}_{w_t | w_0^{(i)}}\left[ c^{\rm noise}_{\theta}(w_t, t) \right] > \tau\}, where τ\tau is a threshold (e.g., 0.5ϵ0.5 - \epsilon). This means a sample is considered "usable" at time tt if the classifier is sufficiently confused about its origin when that much noise is added.

    • Training Objective: The diffusion model hθ(xt,t)h_{\theta}(x_t, t) is then trained using an objective similar to Ambient Diffusion, but generalized for sample-specific annotation times timint_i^{\min}:

      Jambiento(θ)=EtU[0,T]i:timin<tExtxtimin(i)[α(t,timin)hθ(xt,t)+(1α(t,timin))xtxtimin(i)2]J_{\mathrm{ambient-o}}(\theta) = \mathbb E_{t \in \mathcal U[0, T]} \sum_{i: t^{\min}_i < t}\mathbb E_{x_t|x_{t^{\min}_i}^{(i)}} \left[ \left|\left| \alpha(t, t^{\min}_i)h_{\theta}(x_t, t) + (1 - \alpha(t, t^{\min}_i))x_t - x_{t^{\min}_i}^{(i)}\right|\right|^2\right]

      where α(t,timin)=σ2(t)σ2(timin)σ2(t)\alpha(t, t^{\min}_i) = \frac{\sigma^2(t) - \sigma^2(t^{\min}_i)}{\sigma^2(t)}, and xtimin(i)x_{t^{\min}_i}^{(i)} is the ii-th datapoint w0(i)w_0^{(i)} with Gaussian noise corresponding to σ(timin)\sigma(t_i^{\min}) added.

    • This effectively treats arbitrarily corrupted data as data corrupted by a known amount of additive Gaussian noise, at the cost of information loss from the added noise during annotation.

  • Limitations: The method works best for high-frequency corruptions (like blur) because Gaussian noise primarily dampens high frequencies. Low-frequency corruptions (color shifts, contrast reduction) are more challenging.

2. Learning in the Low-Noise Regime (Leveraging Synthetic and OOD Data)

  • Insight: Natural images exhibit locality. At low diffusion times, denoising primarily relies on local image regions (small receptive fields). If OOD or synthetic data share similar local patch statistics with the target distribution, their patches can be used for training at low noise levels.
  • Implementation:
    • Crop Size Mapping: The paper establishes an empirical relationship between diffusion time tt and the minimal crop size crop(t)\mathrm{crop}(t) needed for optimal denoising at that time.
    • Crops Classifier: A classifier ccropsθc^{\rm{crops}_{\theta}} is trained to distinguish between crops from the clean distribution p0p_0 and crops from an OOD/synthetic distribution p~0\tilde{p}_0.
    • Maximum Usable Time Annotation: For each OOD sample, a maximum noise time tnmaxt^{\max}_n is determined, below which its crops are indistinguishable from clean data crops:
    • tnmax=sup{t[0,T]:1SBy0SB[cθcrops(A(t)(yt))]>τ}t^{\mathrm{max}}_{n} = \sup\left\{t \in [0, T] : {1 \over |S_B|}\sum_{y_0 \in S_B} \left[ c^{\rm{crops}}_{\theta}(A(t)(y_t)) \right] > \tau\right\}, where A(t)A(t) is a random patch selector of size crop(t)\mathrm{crop}(t).
    • For ttnmaxt \leq t^{\max}_n, OOD samples are used with the standard diffusion objective, as their local patches are considered equivalent to clean data patches.
  • The Donut Paradox: A sample w0(i)w_0^{(i)} might be usable for ttimint \geq t^{\min}_i (due to global noise making it indistinguishable) and for ttimaxt \leq t^{\max}_i (due to local patches being indistinguishable), but not for t(timax,timin)t \in (t^{\max}_i, t^{\min}_i). In this intermediate "donut hole" regime, the noise is insufficient to merge global distributions, but the required receptive field is large enough that patch-level differences become apparent.

Theoretical Justification

The paper provides a theoretical analysis of the bias-variance trade-off. It compares training with n1n_1 clean samples (Algorithm 1) versus n1+n2n_1+n_2 mixed-quality samples (Algorithm 2). Key results:

  • Diffusion modeling is linked to Gaussian kernel density estimation.
  • Theorem \ref{theorem:gaussian kernel density estimation}: Provides a bound on the distance d(pσ,p^σ)d(p_\sigma, \hat{p}_\sigma) for Gaussian kernel density estimation.
  • Theorem \ref{theorem:smoothing brings closer} (Distance contraction under noise): Shows that d(PN(0,σ2I),QN(0,σ2I))d(P,Q)D2σd(P\circledast\mathcal{N}(0,\sigma^2I), Q\circledast\mathcal{N}(0,\sigma^2I)) \le d(P,Q) \cdot \frac{D}{2\sigma}, meaning Gaussian noise reduces the distance between distributions. This implies that for sufficiently high noise σt\sigma_t (i.e., ttnmint \ge t^{\min}_n), using a larger dataset of mixed-quality samples (Algorithm 2) can lead to better estimation of ptp_t than using a smaller dataset of only clean samples (Algorithm 1), as the bias introduced by low-quality data is offset by the variance reduction from more samples. The classifier cnoiseθc^{\text{noise}_{\theta}} empirically finds this tnmint^{\min}_n.

Experiments and Results

  • Controlled Experiments (CIFAR-10, FFHQ):
    • Successfully trained models with synthetically corrupted data (Gaussian blur, JPEG compression). Ambient-o outperformed baselines (training on all data equally, or only on filtered clean data).
    • For Gaussian blur on CIFAR-10 (10% clean, 90% blurred with σB=0.6\sigma_B=0.6), Ambient-o achieved FID 5.34, while "Only Clean" got 8.79 and "All data" got 11.42. The average annotated noise σˉtnmin\bar{\sigma}_{t_n}^{\min} was 1.38.
    • The low-noise regime strategy was validated by training a dog generation model using 10% dog images and adding cat images as OOD data. This improved FID from 12.08 (dogs only) to 8.92 (dogs + cats with classifier-based crop usage).
  • ImageNet:
    • Used CLIP-IQA to label ImageNet samples into high-quality (top 10%) and low-quality (bottom 90%).
    • Ambient-o-XXL+crops (using both high-noise and low-noise strategies) achieved state-of-the-art FID on ImageNet-512:
    • Test FID (no CFG): 2.78 (vs. EDM2-XXL: 2.88)
    • Test FID (w/ CFG): 2.53 (vs. EDM2-XXL: 2.73)
    • This demonstrates benefits on real-world datasets with naturally heterogeneous quality.
  • Text-to-Image (MicroDiffusion):
    • Trained on a mix of datasets including DiffusionDB (lower-quality synthetic data). Ambient-o treated DiffusionDB samples by assigning them σmin=2\sigma_{\min}=2 (i.e., only used for high-noise regime).
    • Improved FID on COCO zero-shot generation from 12.37 (baseline) to 10.61 (Ambient-o).
    • Showed better diversity compared to fine-tuning on only high-quality data, which tends to reduce diversity. Ambient-o achieved a >13% increase in DINO Vendi Diversity.

Practical Implementation Considerations

  • Classifier Training: The method requires training auxiliary classifiers (cnoiseθc^{\text{noise}_{\theta}} and ccropsθc^{\rm{crops}_{\theta}}). This adds computational overhead but can be amortized if the classifiers are general or if annotation times are hand-picked based on quality proxies (as done in the text-to-image experiment with DiffusionDB).
  • Loss Pre-conditioning: The paper mentions using a pre-conditioning weight λamb(σ,σmin)=σ4/(σ2σmin2)2\lambda_{\text{amb}}(\sigma, \sigma_{\min}) = \sigma^4 / (\sigma^2 - \sigma_{\min}^2)^2 for the ambient loss, similar to EDM-2, and a buffer zone around σmin\sigma_{\min} to prevent singularities.
  • Data Annotation: The quality of the high-quality set SGS_G and low-quality set SBS_B (or the quality metric used to define them, like CLIP-IQA) is crucial, as annotations for other samples depend on their similarity to these sets.

Takeaways

  1. Low-quality in-distribution images and high-quality out-of-distribution images can be used to produce high-quality in-distribution images.
  2. Real datasets contain heterogeneous samples. Ambient-o explicitly accounts for this quality variability, improving generation quality.
  3. Ambient-o treats synthetic data as a form of corrupted data, leading to superior visual quality and diversity compared to relying only on real samples or simple filtering.

The paper provides a principled way to incorporate a wider range of data into diffusion model training, potentially reducing the reliance on expensive and heuristic data filtering processes. It shows practical improvements in image quality and diversity across various tasks and datasets.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com