Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion (2310.03502v1)

Published 5 Oct 2023 in cs.CV

Abstract: Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

Citations (48)

View on Semantic Scholar

Summary

The paper presents an advanced model that combines CLIP-text embeddings, image prior mapping, and latent diffusion to enhance text-to-image synthesis.
It achieves a state-of-the-art FID score of 8.03 on the COCO-30K dataset, outperforming models like Stable Diffusion and GLIDE.
Practical contributions include open-source code release and demo systems for interactive text-guided image editing across various applications.

Kandinsky: An Improved Text-to-Image Synthesis Model

The paper presents Kandinsky, an advanced text-to-image synthesis model that integrates latent diffusion architecture with an image prior. The development of text-to-image generation technologies has seen considerable advancements, particularly with the advent of diffusion-based models, which are renowned for their high image quality. Kandinsky innovatively combines the principles of image prior models and latent diffusion techniques, marking a substantial contribution to the field.

Model Architecture

At the heart of the Kandinsky model is a sophisticated system combining three primary steps: text encoding, embedding mapping (image prior), and latent diffusion. For text encoding, the model leverages CLIP-text embeddings, which are projected onto image embeddings using a uniquely trained image prior. This prior model, based on a diffusion process, further enhances the semantic alignment between text and images. Additionally, the architecture includes a modified MoVQ implementation, functioning as the image autoencoder component. The latent diffusion process utilizes a UNet architecture with a custom pre-trained autoencoder, ensuring efficient inference and high-quality image synthesis.

Experimental Evaluation

The Kandinsky model was evaluated extensively on the COCO-30K dataset, achieving a Fréchet Inception Distance (FID) score of 8.03 for a resolution of 256x256. This score positions Kandinsky as the leading open-source performer in terms of quantifiable image generation quality, surpassing other acclaimed models such as Stable Diffusion and GLIDE.

The experimental setup included a rigorous ablation paper to optimize the image prior design. Different configurations, including linear, residual, and diffusion transformer priors, were explored. Interestingly, a simple linear mapping demonstrated the best FID score, suggesting a potential linear relationship between the visual and textual embedding spaces. Further experiments incorporating latent quantization in the MoVQ autoencoder confirmed its effectiveness in bolstering image quality.

Practical Contributions

In addition to theoretical insights, the paper emphasizes practical implementations of the Kandinsky model. The research team released the source code and pre-trained model checkpoints licensed under Apache 2.0, permitting both commercial and non-commercial applications. A user-friendly demo system was developed to showcase Kandinsky's capabilities, supporting various generative modes, including inpainting, outpainting, and interactive text-guided image editing on web-based platforms and Telegram bots.

Implications and Future Directions

Kandinsky's contributions reflect significant implications for the domains of computer vision and AI-driven content creation. It offers a robust framework for creating photorealistic images based on textual input, with potential applications ranging from 3D object synthesis to video generation and controllable image editing. Looking ahead, future research may focus on exploring the latest image encoders, refining UNet architectures, and enhancing semantic coherence between input text and generated imagery. Furthermore, advancements in generating higher-resolution images and developing moderation layers or robust classifiers to manage abusive outputs remain pertinent areas for exploration.

Overall, Kandinsky exhibits substantial improvements in text-to-image synthesis, providing a versatile platform for AI researchers and developers in the field. The model's open-source nature ensures broad accessibility, fostering ongoing innovation and application development in the text-to-image synthesis domain.