Kandinsky 5.0: Open-Source Generative Models
- Kandinsky 5.0 is a suite of large-scale, open-source models for high-resolution text-to-image, image editing, and text-to-video generation using a unified Cross-Attention Diffusion Transformer.
- The framework features three key variants—Image Lite, Video Lite, and Video Pro—optimized across parameter regimes from 2B to 19B with innovations like LTF and NABLA for enhanced performance.
- It leverages extensive, curated multi-modal datasets and a multi-stage training pipeline incorporating flow matching, fine-tuning, and distillation to achieve scalable, high-quality generative outputs.
Kandinsky 5.0 is a family of large-scale, open-source foundation models for high-resolution image and video generation, comprising state-of-the-art text-to-image, in-context image editing, and text-to-video/image-to-video models. The framework features three principal model line-ups—Image Lite, Video Lite, and Video Pro—spanning a parameter regime from 2 billion to 19 billion. Kandinsky 5.0 leverages a unified Cross-Attention Diffusion Transformer (CrossDiT) backbone, an extensive, highly curated multi-modal dataset pipeline, innovative training techniques including flow matching, advanced supervised and reinforcement learning-based post-training, and extensive optimizations for scalability and throughput. All code, checkpoints, and recipes are freely available under an open-source license, targeting researchers and practitioners seeking extensible, high-quality generative capabilities (Arkhipkin et al., 19 Nov 2025).
1. Model Line-Ups and Core Architecture
Kandinsky 5.0 introduces three main model variants, each optimized for specific generative domains:
| Model Variant | Parameter Count | Primary Tasks | Max Resolution | Distinctive Features |
|---|---|---|---|---|
| Image Lite | 6B | Text-to-image, in-context image editing | 1408 px | High resolution, image editing |
| Video Lite | 2B | Text-to-video, image-to-video (≤10s, ≤768 px) | 768 px | Fast, lightweight, video focus |
| Video Pro | 19B | High-quality text-to-video and image-to-video (≤10s, HD) | 1408 px | Superior video generation |
All models employ the same architectural backbone: a latent diffusion model whose core is a transformer-based U-Net—CrossDiT. Key CrossDiT modules include interleaved self-attention, cross-attention (with text), MLP blocks with GeLU activations, residual connections after each sub-block, and adaptive normalization layers. Dual text encoders are integrated: Qwen2.5-VL (7B) for dense text representations (dimension 3584, context 256), and CLIP ViT-L/14 (dimension 768, context 77) for additional adaptive normalization. Images are encoded with FLUX.1-dev VAE, while videos use HunyuanVideo VAE for temporal consistency.
Variant-specific architectural hyperparameters are detailed as follows:
| Model | #CrossDiT Blocks | #LTF Blocks | Linear Dim | Model Emb | Time Emb |
|---|---|---|---|---|---|
| Image Lite | 50 | 2 | 10,240 | 2560 | 512 |
| Video Lite | 32 | 2 | 7168 | 1792 | 512 |
| Video Pro | 60 | 4 | 16,384 | 4096 | 1024 |
Notable architectural innovations unique to Kandinsky 5.0 include the Linguistic Token Refiner (LTF) for denoising text embeddings pre-fusion, and Neighborhood Adaptive Block-Level Attention (NABLA), a dynamic, block-sparse attention mechanism optimized for video.
2. Data Curation Lifecycle
Kandinsky 5.0’s data pipeline is characterized by massive scale, heterogeneity, and rigorous filtering to support robust multi-modal generation.
- Kandinsky T2I (text-to-image): 500 million images (LAION/COYO/web, min side ≥256 px).
- Kandinsky T2V (text-to-video): 250 million video scenes, 2–60 seconds, various aspect ratios.
- Kandinsky I2I (image editing instruction): ~150 million image pairs with edit descriptions.
- Kandinsky RCC (Russian Cultural Code): 229k videos, 768k images, manually curated with bilingual captions.
- Supervised Fine-Tuning (SFT) datasets: v1 (strict): 2,833 video/45k image; v2 (relaxed): 12,461 video/153k image.
Processing pipelines are domain-specific: images are filtered for resolution, deduplicated via perceptual pHash, cleansed of watermarks (ResNeXt101/YOLO), scored for quality (TOPIQ/Q-Align), text presence (CRAFT), and complexity (SAM 2/Sobel). Captioning uses InternVL2-26B, InternLM3-8B, and Qwen2.5VL-32B, with post-processing for text cleanliness. Video datasets undergo scene segmentation (PySceneDetect), deduplication, multi-stage technical and aesthetic quality assessment, object/scene tagging, synthetic captioning (Tarsier2-7B), and cluster-based sampling (InternVideo2-1B embeddings, 10k-cluster k-means). Instruction datasets apply similarity-based deduplication (CLIP, DINO, face), geometric verification (LoFTR + RANSAC), domain-specific exclusion heuristics, and instruction generation via fine-tuned GLM 4.5.
SFT data are domain- and subdomain-clustered with Qwen2.5-VL-Instruct-32B and CLIP, supporting granular supervised adaptation.
3. Multi-Stage Training Methodology
Kandinsky 5.0 employs a multi-regime, multi-stage training pipeline:
- Pre-training Regimes: Four settings—text-to-image (T2I), instruct editing, text-to-video (T2V), image-to-video (I2V)—with diffusion in latent space trained via a flow matching objective:
where are latents, is the velocity predictor, and is time discretization.
- Fine-Tuning Regimes:
- Image SFT: 153k images (EN/RU captions), 9 domains/2–9 subdomains. Finetune per subdomain. “Model-soup” combines checkpoints weighted by .
- Video SFT: 2.8k videos + 45k images per domain, two approaches—standard fine-tuning and model-soup averaging (better stability/quality).
- Distillation:
- RL-based Post-training (Image Lite):
- Reward model: Qwen2.5-VL-7B, outputting “Yes” probability for (gen/real, prompt) tuples.
- DRaFT-K fine-tuning: loss combines reward-based loss and KL (), backpropagating through the last steps.
4. Training and Inference Optimization Strategies
Kandinsky 5.0 incorporates pragmatic and analytical training and inference optimizations for both throughput and resource efficiency.
- Training Optimizations:
- Pre-encoded VAE latents are packed in .tar archives, streamed via a 100 Gbps link, dynamically batched by aspect ratio and time budget.
- Distributed training uses PyTorch FSDP/HSDP with sequence parallelism; text encoder on 32 GPUs, transformer on 64, leveraging NVLink interconnect and async checkpoints.
- Activation checkpointing and host offloading reduce peak activation memory by 40%.
- Analytical models provide a priori estimates of training step time and memory:
Inference Optimizations:
- VAE encoder tiling with torch.compile yields speedup.
- CrossDiT: torch.compile, fused kernels, MagCache (+46% throughput), Flash/Sage attention (≤5s clips), and NABLA for long/HD video.
- NABLA: blockwise Q/K pooling (N=64), CDF sparsity per attention head, union with sliding-tile patterns and fractal reordering; enables speedup at 90% sparsity, negligible perceptual loss.
Empirical performance on a single 80GB H100 GPU:
| Model | Frames | Resolution | NFEs | Time (s) | Mem (GB) |
|---|---|---|---|---|---|
| Video Lite 5s | 121 | 512×768 | 100 | 139 | 21 |
| Video Lite Flash | 121 | 512×768 | 16 | 35 | 21 |
| Video Pro 10s | 241 | 768×1280 | 100 | 3218* | 68 |
| Video Pro Flash | 241 | 768×1280 | 16 | 576* | 68 |
| Image Lite | 1 | 1024×1024 | 100 | 13 | 17 |
*With activation offloading.
5. Empirical Performance and Evaluation
Evaluation employs both human side-by-side (SBS) assessments and quantitative metrics:
- Human SBS: More than 20 annotators (5-way overlap) assessed prompt following (entity/action count, property/placement), and visual quality (composition, lighting, artifact rate, realism, motion coherence) on Elementary.center.
- Comparative Results:
- Video Lite vs. Sora (OpenAI) on MovieGen: exceeds Sora in motion, artifact reduction, overall quality in ≥60% of 65k judgments; prompt following is parity.
- Video Lite vs. Wan 2.2 5B/14B: outperforms on visual quality and motion; Wan leads on prompt granularity.
- Video Lite vs. Kandinsky 4.1: 59% motion/visual preference, prompt adherence parity.
- Video Pro vs. Veo 3/Fast: Veo 3 leads in prompt following; K5.0 Pro in visual quality/motion coherence.
- Quantitative Metrics:
- Generation Efficiency:
- Video Lite Flash generates a 5s clip in ~35s on H100; Video Pro Flash (10s, 768×1280) in ~576s (with offloading).
6. Open-Source Availability and Model Extensibility
All code, training checkpoints, and pipeline integrations are freely accessible:
- Repositories: GitHub (https://github.com/kandinskylab/kandinsky-5), HuggingFace (https://huggingface.co/kandinskylab).
- License: MIT.
- Diffusers Integration: Direct compatibility with the diffusers Python package:
1 2 3 4 |
from diffusers import KandinskyV5Img, KandinskyV5Video model = KandinskyV5Img.from_pretrained("kandinskylab/kandinsky-5-image-lite") pipe = model.pipeline() image = pipe("A golden retriever puppy in a meadow at sunrise", guidance_scale=7.5).images[0] |
- Customization and Adaptation: Recipes exist for domain-specific SFT (e.g., medical, cartoon, architectural), tuning NABLA sparsity threshold for compute-quality trade-off, and extension to multi-modal tasks (captioning, audio generation). The suite supports “foundation” model construction by weight-averaging across line-ups.
- Prospective Directions: Planned updates include longer context (>1024 tokens) text encoders for improved prompt alignment, unification toward a cross-modal T2I/T2V/I2I/I2V/V2A backbone, consumer-level real-time inference via further distillation/quantization, and curriculum-based pretraining to improve rare category generalization and mitigate dataset biases.
Kandinsky 5.0 represents an integrative advance in large-scale generative modeling, combining flexible transformer-diffusion architectures, robust multi-stage data pipelines, sparse attention for video, and advanced fine-tuning/distillation strategies as an open foundation for future research and application (Arkhipkin et al., 19 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free