Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative AI in Vision: A Survey on Models, Metrics and Applications (2402.16369v1)

Published 26 Feb 2024 in cs.CV, cs.AI, and cs.LG
Generative AI in Vision: A Survey on Models, Metrics and Applications

Abstract: Generative AI models have revolutionized various fields by enabling the creation of realistic and diverse data samples. Among these models, diffusion models have emerged as a powerful approach for generating high-quality images, text, and audio. This survey paper provides a comprehensive overview of generative AI diffusion and legacy models, focusing on their underlying techniques, applications across different domains, and their challenges. We delve into the theoretical foundations of diffusion models, including concepts such as denoising diffusion probabilistic models (DDPM) and score-based generative modeling. Furthermore, we explore the diverse applications of these models in text-to-image, image inpainting, and image super-resolution, along with others, showcasing their potential in creative tasks and data augmentation. By synthesizing existing research and highlighting critical advancements in this field, this survey aims to provide researchers and practitioners with a comprehensive understanding of generative AI diffusion and legacy models and inspire future innovations in this exciting area of artificial intelligence.

Generative AI Diffusion Models: Techniques, Applications, and Future Directions

Introduction

Generative models have undergone a significant transformation with the introduction of diffusion models, offering a versatile framework for creating high-quality data samples across images, text, and audio. While originating as a method for denoising images, diffusion models have evolved to tackle a more extensive range of creative and data augmentation tasks, thanks to their ability to capture complex data distributions. This survey paper provides an insightful overview of the state-of-the-art (SOTA) techniques in generative AI diffusion models, explores their broad applications, and discusses the challenges that lie ahead.

Evolution of Generative Models in Vision

The journey of generative models in vision began with relatively simple models like Hidden Markov Models (HMMs) and progressed through significant milestones such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), each introducing new capabilities and addressing limitations of previous models. Despite their successes, GANs and VAEs faced challenges relating to training stability, computational efficiency, and the ability to capture highly complex data distributions. The advent of diffusion models, inspired by principles of thermodynamics, marked a pivotal turn in this evolving landscape, offering a novel approach characterized by the gradual denoising of data to generate new samples.

A Deep Dive into Diffusion Models

Denoising Diffusion Probabilistic Models (DDPMs)

DDPMs represent a class of diffusion models that gradually introduce Gaussian noise to data, creating a series of increasingly distorted samples. A reverse process, leveraging a deep neural network, then works to progressively denoise these samples, reconstructing the original data or generating new samples from the same distribution. This denoising process is meticulously controlled through a predefined noise schedule, making DDPMs exceptionally adept at generating realistic and diverse samples.

Noise Conditional Score Models and Stochastic Differential Equations Generative Models

Building on DDPMs, Noise Conditional Score Models and Stochastic Differential Equations (SDEs) Generative Models introduce sophisticated mechanisms for data generation. These models refine the scoring function, optimize the noise perturbation process, and employ advanced mathematical formulations, such as SDEs, to represent the diffusion process continuously. This mathematical rigor enhances the models' capability to generate new data samples with remarkable accuracy and efficiency.

Applications Across Vision Tasks

Generative AI diffusion models have found applications in a wide array of vision tasks, demonstrating their versatility and effectiveness. Notable applications include text-to-image generation, image inpainting, and super-resolution, where these models excel at creating realistic images from textual descriptions, repairing damaged images, and enhancing image resolution, respectively. These applications underscore the potential of diffusion models to transcend traditional limitations and inspire novel solutions to longstanding challenges in the field of computer vision.

Challenges and Future Directions

Despite their promising advances, generative AI diffusion models face several challenges, such as the need for increased training stability, improved scalability, and enhanced interpretability. Addressing these challenges will be crucial for unlocking the models' full potential and fostering further innovation. Additionally, future research could explore applications in time-series forecasting, develop physics-inspired generative models, and address ethical considerations related to bias and societal impact.

Conclusion

The exploration of generative AI diffusion models marks a significant milestone in the evolution of generative models in vision. By offering a comprehensive overview of current techniques, applications, and challenges, this survey aims to inspire continued research and innovation in this exciting field. As the community tackles existing challenges and ventures into uncharted territories, generative AI diffusion models are poised to revolutionize the landscape of artificial intelligence, opening new avenues for creativity and data augmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (114)
  1. Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  2. Towards principled methods for training generative adversarial networks, 2017.
  3. Wasserstein gan, 2017.
  4. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, page 417–424, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
  5. Began: Boundary equilibrium generative adversarial networks, 2017.
  6. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  7. Demystifying mmd gans, 2021.
  8. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7327–7347, 2022.
  9. Glean: Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14245–14254, 2021.
  10. Re-imagen: Retrieval-augmented text-to-image generator, 2022.
  11. Activating more pixels in image super-resolution transformer, 2023a.
  12. Recursive generalization transformer for image super-resolution, 2023b.
  13. Dual aggregation transformer for image super-resolution, 2023c.
  14. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):53–65, 2018.
  15. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  17. Diffusion models beat gans on image synthesis, 2021.
  18. Cogview: Mastering text-to-image generation via transformers, 2021.
  19. Nice: Non-linear independent components estimation, 2015.
  20. Density estimation using real nvp, 2017.
  21. Carl Doersch. Tutorial on variational autoencoders, 2021.
  22. The fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982.
  23. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  24. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10021–10030, 2023.
  25. Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks, 2017.
  26. Generative adversarial networks, 2014.
  27. Alex Graves. Generating sequences with recurrent neural networks, 2014.
  28. Vector quantized diffusion model for text-to-image synthesis, 2022.
  29. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 98–107, 2022.
  30. Improved training of wasserstein gans, 2017.
  31. Clipscore: A reference-free evaluation metric for image captioning, 2022.
  32. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
  33. Denoising diffusion probabilistic models, 2020.
  34. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  35. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  36. A survey on gans for computer vision: Recent research, analysis and taxonomy. Computer Science Review, 48:100553, 2023.
  37. Globally and locally consistent image completion. ACM Trans. Graph., 36(4), 2017.
  38. Image-to-image translation with conditional adversarial networks, 2018.
  39. Scaling up gans for text-to-image synthesis, 2023.
  40. A style-based generator architecture for generative adversarial networks, 2019.
  41. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  42. Auto-encoding variational bayes, 2022.
  43. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964–3979, 2021.
  44. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
  45. Improved precision and recall metric for assessing generative models, 2019.
  46. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29–37, Fort Lauderdale, FL, USA, 2011. PMLR.
  47. A tutorial on energy-based learning. 2006.
  48. Photo-realistic single image super-resolution using a generative adversarial network, 2017.
  49. Controllable text-to-image generation, 2019.
  50. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
  51. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11461–11471, 2022.
  52. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  53. Generating images from captions with attention, 2016.
  54. Conditional generative adversarial nets, 2014.
  55. Spectral normalization for generative adversarial networks, 2018.
  56. Learning deep energy models. pages 1105–1112, 2011.
  57. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022a.
  58. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022b.
  59. f-gan: Training generative neural samplers using variational divergence minimization, 2016.
  60. Normalizing flows for probabilistic modeling and inference, 2021.
  61. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  62. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
  63. Learning transferable visual models from natural language supervision, 2021.
  64. Zero-shot text-to-image generation, 2021.
  65. Hierarchical text-conditional image generation with clip latents, 2022a.
  66. Hierarchical text-conditional image generation with clip latents, 2022b.
  67. Generative adversarial text to image synthesis, 2016.
  68. Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF international conference on computer vision, pages 181–190, 2019.
  69. Variational inference with normalizing flows, 2016.
  70. High-resolution image synthesis with latent diffusion models, 2022.
  71. Same same but differnet: Semi-supervised defect detection with normalizing flows, 2020.
  72. Image super-resolution via iterative refinement, 2021.
  73. Palette: Image-to-image diffusion models, 2022a.
  74. Photorealistic text-to-image diffusion models with deep language understanding, 2022b.
  75. Assessing generative models via precision and recall, 2018.
  76. Improved techniques for training gans, 2016.
  77. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications, 2017.
  78. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis, 2023.
  79. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, 2017.
  80. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis, 54:30–44, 2019.
  81. Perceptual extreme super resolution network with receptive field block, 2020.
  82. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  83. Generative modeling by estimating gradients of the data distribution, 2020a.
  84. Improved techniques for training score-based generative models, 2020b.
  85. How to train your energy-based models, 2021.
  86. Sliced score matching: A scalable approach to density and score estimation, 2019.
  87. Score-based generative modeling through stochastic differential equations, 2021.
  88. Going deeper with convolutions, 2014.
  89. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  90. Neural autoregressive distribution estimation, 2016.
  91. Wavenet: A generative model for raw audio, 2016a.
  92. Pixel recurrent neural networks, 2016b.
  93. Conditional image generation with pixelcnn decoders, 2016c.
  94. Neural discrete representation learning, 2018.
  95. Attention is all you need, 2023.
  96. Esrgan: Enhanced super-resolution generative adversarial networks, 2018.
  97. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
  98. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 649–655, 2022.
  99. Ediffsr: An efficient diffusion probabilistic model for remote sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 62:1–14, 2024.
  100. Diffusion models: A comprehensive survey of methods and applications, 2023.
  101. Generative adversarial network in medical imaging: A review. Medical Image Analysis, 58:101552, 2019.
  102. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  103. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows, 2021.
  104. Vector-quantized image modeling with improved vqgan, 2022a.
  105. Scaling autoregressive models for content-rich text-to-image generation, 2022b.
  106. Scaling autoregressive models for content-rich text-to-image generation, 2022c.
  107. Deep structured energy based models for anomaly detection. In International conference on machine learning, pages 1100–1109. PMLR, 2016.
  108. Text-to-image diffusion models in generative ai: A survey, 2023a.
  109. Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution, 2023b.
  110. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, 2017.
  111. Self-attention generative adversarial networks, 2019.
  112. Transcending the limit of local window: Advanced super-resolution transformer with adaptive token dictionary, 2024.
  113. Energy-based generative adversarial network, 2017.
  114. Large scale image completion via co-modulated generative adversarial networks, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Gaurav Raut (3 papers)
  2. Apoorv Singh (14 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com