Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework (2410.21061v1)

Published 28 Oct 2024 in cs.CV, cs.AI, and cs.MM

Abstract: Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.

Authors (12)

Vladimir Arkhipkin (9 papers)
Viacheslav Vasilev (8 papers)
Andrei Filatov (5 papers)
Igor Pavlov (10 papers)
Julia Agafonova (4 papers)
Nikolai Gerasimenko (4 papers)
Anna Averchenkova (1 paper)
Evelina Mironova (3 papers)
Anton Bukashkin (1 paper)
Konstantin Kulikov (1 paper)
Andrey Kuznetsov (36 papers)
Denis Dimitrov (27 papers)

Summary

Kandinsky 3: Advanced Text-to-Image Synthesis for Multifunctional Generative Framework

The paper details "Kandinsky 3," a novel text-to-image (T2I) model grounded in latent diffusion, contributing significantly to generative modeling through its adaptability and efficiency in various generation tasks. Developed by researchers from Sber AI and associated institutions, Kandinsky 3 stands out for its multifunctional generative capabilities and streamlined design that facilitates a range of applications such as inpainting, outpainting, image fusion, and video generation.

At the core of Kandinsky 3 is its architecture that employs a latent diffusion model, enhancing not only text-to-image synthesis but also supporting text-to-video (T2V) and image-to-video (I2V) transformations. This multifunctional adaptability is a testament to the model's simplicity and efficiency, marking a progression from its predecessors, including the Kandinsky 2 model.

Architectural Insights

Kandinsky 3 encapsulates an elaborate architecture combining a text encoder, diffusion U-Net, and image decoder. The Flan-UL2 model serves as the text encoder, hailing from advanced language processing capabilities with 8.6 billion parameters. This decision underscores a robust understanding of text inputs. The image decoder integrates a decoder from Sber-MoVQGAN, while the U-Net architecture borrows elements from BigGAN-deep, implementing convolutional blocks for improved feature extraction. Importantly, the encoder and decoder remain frozen during U-Net training, encapsulating a modular training strategy.

The paper presents an intriguing approach to diffusion model acceleration via a distilled version that increases inference speed by three times while maintaining image quality and text comprehension levels. However, text comprehension does see some compromise, a trade-off partially mitigated by utilizing the distilled model as a refiner in generating state-of-the-art results.

Multifunctional Generative Framework

Kandinsky 3's expansion into image editing and video generation realms demonstrates the model's versatile capabilities. The paper provides a thorough exploration of several extensions:

Inpainting and Outpainting: These tasks are approached by modifying the Kandinsky 3 model to handle additional channels for image and mask processing, trained on a comprehensive dataset with varied mask generations.
Image Editing and Fusion: Techniques such as image fusion and style transfer are seamlessly integrated using frameworks like IP-Adapter and ControlNet, facilitating diverse manipulative capabilities in images.

Moreover, the animation and T2V capabilities are informed by sophisticated pipelines like Deforum, enhancing the temporal dynamics and realism of generated content. The T2V model leverages the Kandinsky 3 as a backbone, offering high-resolution video synthesis underpinned by stable architectural design.

Evaluation and Open Access Deployment

The authors prioritize human evaluation over conventional metrics like FID, addressing the nuance lost in automated scoring systems. The subjective assessment places Kandinsky 3 competitively against established models such as Midjourney and DALL-E 3, particularly distinguishing in visual fidelity and alignment with textual descriptions.

In a commendable move for transparency and collaborative development, Kandinsky 3 is released with open-source code and public checkpoints, reinforcing the model's utility for both academic and practical innovations in AI generative models. This access extends to user-friendly interfaces like the FusionBrain website and a Telegram bot, broadening usage and exploitation potential.

Implications and Future Directions

Kandinsky 3 ushers in a framework that exemplifies the contemporary priorities in T2I modeling, particularly emphasizing adaptability, user accessibility, and operational speed. The model's comprehensive deployment strategy paves the way for scalable applications in creative industries, digital art, and multimedia content creation.

From a theoretical perspective, the Kandinsky 3 demonstrates the viability of integrating multiple generative tasks within a unified model architecture, potentially influencing future research directions in creating even more compact, generalizable models across various input and output domains. Its advancements in distillation methods may serve as a foundation for further exploration of efficiency-focused enhancements in large model architectures.

In conclusion, Kandinsky 3's release signifies an important milestone in open-source generative AI development, presenting a balance of technical sophistication and practical applicability while inviting further exploration and expansion by the research community.

PDF Markdown

Related Papers

Reddit

[2410.21061] Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework (1 point, 0 comments)