Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kandinsky 3.0 Technical Report (2312.03511v3)

Published 6 Dec 2023 in cs.CV, cs.LG, and cs.MM

Abstract: We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  2. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR.
  3. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  4. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 16784–16804. PMLR, 2022.
  5. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  6. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  7. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  8. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  9. Midjourney. https://www.midjourney.com/.
  10. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  11. Improving image generation with better captions, 2023.
  12. Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  14. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021.
  15. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763, 18–24 Jul 2021.
  16. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR, 18–24 Jul 2021.
  17. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  18. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  19. Large scale gan training for high fidelity natural image synthesis, 2019.
  20. Maxvit: Multi-axis vision transformer. ECCV, 2022.
  21. Group normalization. arXiv:1803.08494, 2018.
  22. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
  23. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, 2017.
  24. Abien Fred Agarap. Deep learning using rectified linear units (relu), 2019.
  25. Yi Tay. A new open source flan 20b with ul2. https://www.yitay.net/blog/flan-ul2-20b, 2023.
  26. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations, 2022.
  27. Scaling instruction-finetuned language models, 2022.
  28. Taming transformers for high-resolution image synthesis, 2021.
  29. Movq: Modulating quantized vectors for high-fidelity image generation. Advances in Neural Information Processing Systems, 35:23412–23425, 2022.
  30. A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
  31. Vector-quantized image modeling with improved vqgan, 2022.
  32. LAION-5B: an open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  33. Autoregressive image generation using residual quantization, 2022.
  34. Retrieval-augmented diffusion models, 2022.
  35. Common crawl. https://commoncrawl.org/terms-of-use.
  36. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  37. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  38. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  39. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
  40. Deforum. https://deforum.art/.
  41. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5626–5635, 2019.
  42. Adabins: Depth estimation using adaptive bins. arXiv:2011.14141 [cs.CV], 2020.
  43. Fusionframes: Efficient architectural aspects for text-to-video generation pipeline. arXiv preprint arXiv:2311.13073, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Vladimir Arkhipkin (9 papers)
  2. Andrei Filatov (5 papers)
  3. Viacheslav Vasilev (8 papers)
  4. Anastasia Maltseva (4 papers)
  5. Said Azizov (1 paper)
  6. Igor Pavlov (10 papers)
  7. Julia Agafonova (4 papers)
  8. Andrey Kuznetsov (36 papers)
  9. Denis Dimitrov (27 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub