Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structure-Guided Adversarial Training of Diffusion Models (2402.17563v2)

Published 27 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling, their training primarily emphasizes instance-level optimization, overlooking valuable structural information within each mini-batch, indicative of pair-wise relationships among samples. To address this limitation, we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach, we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution, we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game, distinguishing real manifold structures from the generated ones. SADM substantially improves existing diffusion transformers (DiT) and outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Food-101–mining discriminative components with random forests. In ECCV, 2014.
  2. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
  3. Denoising likelihood score matching for conditional score-based data generation. In International Conference on Learning Representations, 2022.
  4. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2020.
  5. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv, 2022.
  6. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  7. Perception prioritized training of diffusion models. In CVPR, pages 11472–11481, 2022.
  8. Soft diffusion: Score matching for general corruptions. arXiv preprint arXiv:2209.05442, 2022.
  9. Consistent diffusion models: Mitigating sampling drift by learning to be consistent. arXiv preprint arXiv:2302.09057, 2023.
  10. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  11. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
  12. Score-based generative modeling with critically-damped langevin diffusion. In International Conference on Learning Representations, 2022.
  13. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  14. Generative adversarial nets. NeurIPS, 27, 2014.
  15. Caltech-256 object category dataset. 2007.
  16. Efficient diffusion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023.
  17. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017.
  18. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  19. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  20. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pages 8867–8887. PMLR, 2022.
  21. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  22. Lora: Low-rank adaptation of large language models. arxiv, 2021.
  23. Protein-ligand interaction prior for binding-aware 3d molecule diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
  24. Visual prompt tuning. In ECCV, 2022.
  25. Adversarial score matching and improved sampling for image generation. In International Conference on Learning Representations, 2020.
  26. Elucidating the design space of diffusion-based generative models. 2022.
  27. Refining generative process with discriminator guidance in score-based diffusion models. arXiv preprint arXiv:2211.17091, 2022a.
  28. Maximum likelihood training of implicit nonlinear diffusion model. Advances in Neural Information Processing Systems, 35:32270–32284, 2022b.
  29. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. In International Conference on Machine Learning, pages 11201–11228. PMLR, 2022c.
  30. Variational diffusion models. NeurIPS, 34:21696–21707, 2021.
  31. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  32. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2020.
  33. 3d object representations for fine-grained categorization. In ICCV workshops, 2013.
  34. Learning multiple layers of features from tiny images. 2009.
  35. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
  36. Regularizing score-based models with score fokker-planck equations. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  37. The artbench dataset: Benchmarking generative models with artworks. arXiv, 2022.
  38. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
  39. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning, pages 14429–14460. PMLR, 2022.
  40. Diffusion probabilistic models for 3d point cloud generation. In CVPR, pages 2837–2845, 2021.
  41. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, ICML, 2021.
  42. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  43. Input perturbation reduces exposure bias in diffusion models. In International Conference on Machine Learning, pages 26245–26265. PMLR, 2023.
  44. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  45. Danish fungi 2020-not just another image recognition dataset. In WACV, 2022.
  46. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021.
  47. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  48. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  49. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  50. Improved techniques for training gans. NeurIPS, 29, 2016.
  51. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH, pages 1–10, 2022.
  52. Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. In The Eleventh International Conference on Learning Representations, 2022.
  53. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015.
  54. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  55. Improved techniques for training score-based generative models. In NeurIPS, pages 12438–12448, 2020.
  56. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
  57. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021.
  58. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  59. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
  60. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
  61. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  62. The caltech-ucsd birds-200-2011 dataset. 2011.
  63. Diffusion-gan: Training gans with diffusion. arXiv preprint arXiv:2206.02262, 2022.
  64. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 2010.
  65. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
  66. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In ICCV, pages 4230–4239, 2023.
  67. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2021.
  68. Pfgm++: Unlocking the potential of physics-inspired generative models. arXiv preprint arXiv:2302.04265, 2023.
  69. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  70. Improving diffusion-based image synthesis with context prediction. Advances in Neural Information Processing Systems, 36, 2024a.
  71. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024b.
  72. Cross-modal contextualized diffusion models for text-guided visual generation and editing. In The Twelfth International Conference on Learning Representations, 2024c.
  73. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv, 2021.
  74. Diffusion normalizing flow. Advances in Neural Information Processing Systems, 34:16280–16291, 2021.
  75. Realcompo: Dynamic equilibrium between realism and compositionality improves text-to-image diffusion models. arXiv preprint arXiv:2402.12908, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ling Yang (88 papers)
  2. Haotian Qian (4 papers)
  3. Zhilong Zhang (20 papers)
  4. Jingwei Liu (49 papers)
  5. Bin Cui (165 papers)
Citations (7)

Summary

  • The paper introduces SADM, which integrates manifold structures via an adversarial framework to improve image generation fidelity.
  • It employs a structure discriminator to compute pairwise affinity metrics from embedding spaces, enhancing generative performance.
  • Evaluations on datasets like ImageNet demonstrate SADM’s ability to achieve state-of-the-art FID scores and rapid domain adaptation.

Enhancing Diffusion Models with Structure-Guided Adversarial Training

Introduction

Diffusion models have rapidly become a focal point in generative modeling, showcasing impressive results in tasks such as image and audio synthesis. Despite their remarkable capabilities, traditional training strategies primarily target instance-level fidelity, often overlooking the rich structural relationships intrinsic to batched training data. This paper introduces a novel training approach, Structure-guided Adversarial Training of Diffusion Models (SADM), designed to leverage manifold structures within training batches, fostering a deeper understanding of data distributions. Through adversarial interactions between a diffusion generator and a dedicated structure discriminator, SADM aims to capture authentic manifold structures, substantially enhancing diffusion model performance on image generation and cross-domain fine-tuning tasks.

Related Work

The paper situates SADM amid ongoing efforts to refine diffusion model training methodologies, distinguishing two main avenues of research. One stream endeavors to modify training objectives, aiming for improved likelihood estimation, while the other integrates auxiliary models to bolster training precision and stability. Notably, existing approaches concentrate on instance-level optimizations, underutilizing the valuable structural information among batch samples. SADM diverges by directing its focus onto these overlooked relational dynamics, asserting a structural learning paradigm that promises a more comprehensive data distribution modeling.

Methodology

Beyond Instance-Level Training

SADM shifts the training focus from individual instances to the structural relation among batch samples. It involves projecting ground truth samples into an embedding space to compute pairwise relational metrics, encapsulated within an affinity matrix. This matrix, representative of low-dimensional manifold structures, guides the diffusion training process. A novel aspect of SADM is its structure discriminator, a component trained adversarially to distinguish between real and generated manifold structures, thereby enhancing the generative capabilities of the diffusion model.

Structure-Guided Adversarial Training

The adversarial component of SADM pits the diffusion generator against the structure discriminator in a game, aiming to align the manifold structures of generated samples with those observed in real data. This process not only mitigates potential overfitting to simplistic structural patterns but also enriches the model's expressiveness and fidelity to the underlying data distribution.

Evaluations

SADM's efficacy is thoroughly vetted across diverse image datasets, including ImageNet, where it establishes new benchmarks in FID scores for class-conditional image generation. The extensive experiments underscore SADM's superiority not only in capturing intricate data distributions but also in facilitating rapid domain adaptation, marking a significant stride in diffusion model optimization.

Future Developments

Reflecting on the implications of SADM, we anticipate its principles could inspire future enhancements in generative AI, particularly in tasks demanding nuanced distributional comprehension. The adversarial training framework, centered around structural integrity, might find resonance in other modeling challenges beyond image generation, potentially extending to sequential and 3D data representations.

Conclusion

Structure-Guided Adversarial Training of Diffusion Models emerges as a compelling advancement in generative modeling, foregrounding the importance of structural understanding in training diffusion models. By intricately weaving manifold structural awareness into the adversarial training regime, SADM not only achieves state-of-the-art performance in conventional tasks but also hints at the untapped potential of structure-informed methodologies in broader AI research landscapes.