Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't drop your samples! Coherence-aware training benefits Conditional diffusion (2405.20324v1)

Published 30 May 2024 in cs.CV and cs.LG

Abstract: Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.

Citations (1)

Summary

  • The paper introduces CAD, a coherence-aware training method that learns from both high- and low-coherence annotations without discarding valuable data.
  • It integrates a latent coherence score into the diffusion process, boosting image quality in text-to-image, class-conditional, and semantic map generation tasks.
  • Empirical results demonstrate that CAD outperforms baselines in FID scores and user preference, proving its effectiveness in managing noisy conditions.

Coherence-aware training in Conditional Diffusion Models

Conditional diffusion models have established themselves as vital tools in the generative modeling landscape, leveraging additional information such as class labels, semantic maps, or text captions to guide the generation process. However, real-world conditional information is often noisy or unreliable due to annotation errors. Addressing this issue, the paper “Don't drop your samples! Coherence-aware training benefits Conditional Diffusion” by Dufour et al. introduces a novel method termed Coherence-Aware Diffusion (CAD) to integrate coherence in conditional information into diffusion models, allowing them to learn effectively from noisy annotations.

Methodology

The proposed CAD framework introduces a coherence score associated with each data point, reflecting the quality of the conditional information. This score is utilized to condition the diffusion model alongside the conventional conditional information. Through this technique, the model learns to discount or ignore the conditioning when the coherence is low, thus preventing the discarding of valuable data points due to noisy annotations. The coherence score is embedded into a latent vector and merged with the condition to allow the model to dynamically adjust the reliance on conditional information during training. CAD also enhances the Classifier-Free Guidance (CFG) method, tailoring it to leverage coherence information for significant improvements in image quality.

Experimental Setup

The experimental evaluation of CAD spans multiple conditional generation tasks, including:

  • Text-to-Image Generation: Text conditioning uses the CLIP score to estimate coherence between the image and its descriptive caption. The experiments utilize a modified RIN architecture combined with a FLAN-T5 XL encoder.
  • Class-Conditional Image Generation: The dataset comprises CIFAR-10 and ImageNet-64, with an error probability resampling scheme to simulate varying levels of label consistency. The coherence score is derived from an off-the-shelf classifier confidence estimator.
  • Semantic Map Conditioning: The model integrates pixel-level coherence scores derived from class boundaries or confidence estimations, tested on datasets like ADE20k and MS COCO.

Results

Text-Conditional Generation:

Empirical results demonstrate CAD’s superior performance over baseline, filtered, and weighted models across several metrics. On a subset of COCO, CAD achieved the highest FID score (69.4), significantly outperforming both filtered (85.8) and baseline (91.9) models. Furthermore, user studies reflected a strong preference for images generated by CAD, with users favoring CAD images in terms of both quality and alignment with prompts. CAD exhibited images that aligned more closely with complex text prompts and better preserved content diversity.

Class-Conditional Generation:

CAD’s performance on CIFAR-10 and ImageNet with various noise levels showed it yielding better FID and accuracy metrics compared to baseline models. The method was particularly effective at adapting to low-coherence conditions, ensuring that results aligned with the provided class labels more accurately than the baseline. Importantly, the coherence-aware CFG additionally improved image quality without requiring dropout during training.

Semantic Map Conditioning:

The inclusion of coherence maps in training improved both the visual quality and fidelity of generated images. CAD demonstrated augmented flexibility in generating realistic content, particularly under low-coherence conditions. The improvements were evident with significant reductions in FID for ADE20k and MS COCO when incorporating coherence maps alongside semantic segmentation inputs.

Discussion

The CAD framework introduces a remarkably flexible, theoretically sound, and empirically validated approach to conditional diffusion models. By incorporating coherence scores, it effectively mitigates the negative impacts of noisy annotations without discarding valuable data points. This methodology shows promise in enhancing generative quality across diverse conditional generation tasks.

The results of CAD, including its flexibility in adjusting coherence influence, suggest potential future extensions. For example, refining coherence score extraction methods or exploring adaptive coherence scoring techniques may further improve the model's robustness against annotation noise. Additionally, this work underlines the relevance of integrating extrinsic quality metrics directly into model training, a direction that could inspire future research in conditional generative modeling.

Conclusion

In summary, CAD represents a significant advancement in training conditional diffusion models with imperfect annotations. By embedding coherence-awareness, the model achieves high-quality, contextually coherent generative results across multiple tasks. This technique broadens the applicability of conditional diffusion models, promising improvements in diverse domains from text-to-image synthesis to class-conditional and semantic map-guided generation. Future work will likely explore deeper integrations of coherence metrics, further solidifying CAD’s utility in the evolving landscape of generative AI.