Kandinsky 5.0: Open-Source Generative Models

Updated 20 November 2025

Kandinsky 5.0 is a suite of large-scale, open-source models for high-resolution text-to-image, image editing, and text-to-video generation using a unified Cross-Attention Diffusion Transformer.
The framework features three key variants—Image Lite, Video Lite, and Video Pro—optimized across parameter regimes from 2B to 19B with innovations like LTF and NABLA for enhanced performance.
It leverages extensive, curated multi-modal datasets and a multi-stage training pipeline incorporating flow matching, fine-tuning, and distillation to achieve scalable, high-quality generative outputs.

Kandinsky 5.0 is a family of large-scale, open-source foundation models for high-resolution image and video generation, comprising state-of-the-art text-to-image, in-context image editing, and text-to-video/image-to-video models. The framework features three principal model line-ups—Image Lite, Video Lite, and Video Pro—spanning a parameter regime from 2 billion to 19 billion. Kandinsky 5.0 leverages a unified Cross-Attention Diffusion Transformer (CrossDiT) backbone, an extensive, highly curated multi-modal dataset pipeline, innovative training techniques including flow matching, advanced supervised and reinforcement learning-based post-training, and extensive optimizations for scalability and throughput. All code, checkpoints, and recipes are freely available under an open-source license, targeting researchers and practitioners seeking extensible, high-quality generative capabilities (Arkhipkin et al., 19 Nov 2025).

1. Model Line-Ups and Core Architecture

Kandinsky 5.0 introduces three main model variants, each optimized for specific generative domains:

Model Variant	Parameter Count	Primary Tasks	Max Resolution	Distinctive Features
Image Lite	6B	Text-to-image, in-context image editing	1408 px	High resolution, image editing
Video Lite	2B	Text-to-video, image-to-video (≤10s, ≤768 px)	768 px	Fast, lightweight, video focus
Video Pro	19B	High-quality text-to-video and image-to-video (≤10s, HD)	1408 px	Superior video generation

All models employ the same architectural backbone: a latent diffusion model whose core is a transformer-based U-Net—CrossDiT. Key CrossDiT modules include interleaved self-attention, cross-attention (with text), MLP blocks with GeLU activations, residual connections after each sub-block, and adaptive normalization layers. Dual text encoders are integrated: Qwen2.5-VL (7B) for dense text representations (dimension 3584, context 256), and CLIP ViT-L/14 (dimension 768, context 77) for additional adaptive normalization. Images are encoded with FLUX.1-dev VAE, while videos use HunyuanVideo VAE for temporal consistency.

Variant-specific architectural hyperparameters are detailed as follows:

Model	#CrossDiT Blocks	#LTF Blocks	Linear Dim	Model Emb	Time Emb
Image Lite	50	2	10,240	2560	512
Video Lite	32	2	7168	1792	512
Video Pro	60	4	16,384	4096	1024

Notable architectural innovations unique to Kandinsky 5.0 include the Linguistic Token Refiner (LTF) for denoising text embeddings pre-fusion, and Neighborhood Adaptive Block-Level Attention (NABLA), a dynamic, block-sparse attention mechanism optimized for video.

2. Data Curation Lifecycle

Kandinsky 5.0’s data pipeline is characterized by massive scale, heterogeneity, and rigorous filtering to support robust multi-modal generation.

Kandinsky T2I (text-to-image): 500 million images (LAION/COYO/web, min side ≥256 px).
Kandinsky T2V (text-to-video): 250 million video scenes, 2–60 seconds, various aspect ratios.
Kandinsky I2I (image editing instruction): ~150 million image pairs with edit descriptions.
Kandinsky RCC (Russian Cultural Code): 229k videos, 768k images, manually curated with bilingual captions.
Supervised Fine-Tuning (SFT) datasets: v1 (strict): 2,833 video/45k image; v2 (relaxed): 12,461 video/153k image.

Processing pipelines are domain-specific: images are filtered for resolution, deduplicated via perceptual pHash, cleansed of watermarks (ResNeXt101/YOLO), scored for quality (TOPIQ/Q-Align), text presence (CRAFT), and complexity (SAM 2/Sobel). Captioning uses InternVL2-26B, InternLM3-8B, and Qwen2.5VL-32B, with post-processing for text cleanliness. Video datasets undergo scene segmentation (PySceneDetect), deduplication, multi-stage technical and aesthetic quality assessment, object/scene tagging, synthetic captioning (Tarsier2-7B), and cluster-based sampling (InternVideo2-1B embeddings, 10k-cluster k-means). Instruction datasets apply similarity-based deduplication (CLIP, DINO, face), geometric verification (LoFTR + RANSAC), domain-specific exclusion heuristics, and instruction generation via fine-tuned GLM 4.5.

SFT data are domain- and subdomain-clustered with Qwen2.5-VL-Instruct-32B and CLIP, supporting granular supervised adaptation.

3. Multi-Stage Training Methodology

Kandinsky 5.0 employs a multi-regime, multi-stage training pipeline:

Pre-training Regimes: Four settings—text-to-image (T2I), instruct editing, text-to-video (T2V), image-to-video (I2V)—with diffusion in latent space trained via a flow matching objective:

$\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{t,\,x_0,\,\varepsilon}\Bigl\|\,v_\theta(x_t, t)\;-\;\frac{x_{t+1}-x_t}{\Delta t}\Bigr\|^2,$

where $x_t$ are latents, $v_\theta$ is the velocity predictor, and $\Delta t$ is time discretization.

Fine-Tuning Regimes:
- Image SFT: 153k images (EN/RU captions), 9 domains/2–9 subdomains. Finetune per subdomain. “Model-soup” combines checkpoints weighted by $\sqrt{\text{subdomain size}}$ .
- Video SFT: 2.8k videos + 45k images per domain, two approaches—standard fine-tuning and model-soup averaging (better stability/quality).
Distillation:
- CFG Distillation: Reduces from 100 to 50 NFEs, guidance scale $s=5$ , teacher trajectory regression.
- Consistency/Adversarial Distillation: TSCD (trajectory-segmented consistency) for students with as few as 16 NFEs; adversarial hinge loss post-training (RMSprop, $\mathrm{lr}_G$ = $10^{-6}$ , $\mathrm{lr}_D$ = $10^{-4}$ ).
RL-based Post-training (Image Lite):
- Reward model: Qwen2.5-VL-7B, outputting “Yes” probability for (gen/real, prompt) tuples.
- DRaFT-K fine-tuning: loss combines reward-based loss and KL ( $\beta_{\mathrm{KL}}=2\cdot 10^{-2}$ ), backpropagating through the last $K=10$ steps.

4. Training and Inference Optimization Strategies

Kandinsky 5.0 incorporates pragmatic and analytical training and inference optimizations for both throughput and resource efficiency.

Training Optimizations:
- Pre-encoded VAE latents are packed in .tar archives, streamed via a 100 Gbps link, dynamically batched by aspect ratio and time budget.
- Distributed training uses PyTorch FSDP/HSDP with sequence parallelism; text encoder on 32 GPUs, transformer on 64, leveraging NVLink interconnect and async checkpoints.
- Activation checkpointing and host offloading reduce peak activation memory by 40%.
- Analytical models provide a priori estimates of training step time and memory:
$\text{Step} = \frac{d}{d_0} \times \frac{1}{\, d_0 + 14\frac{S}{S_0} + 6\frac{d}{d_0}\; \times L\,B}$

$\mathrm{Mem} = \frac{12\,L\,(9d_t d + 8d^2 + 2d_f d)}{N} + \max\Bigl(\frac{4\,L\,(9d_t d + 8d^2 + 2d_f d)}{N},\,2S(Ld\,o + 18d +2d_f)\Bigr)$
Inference Optimizations:
- VAE encoder tiling with torch.compile yields $2.5\times$ speedup.
- CrossDiT: torch.compile, fused kernels, MagCache (+46% throughput), Flash/Sage attention (≤5s clips), and NABLA for long/HD video.
- NABLA: blockwise Q/K pooling (N=64), CDF sparsity per attention head, union with sliding-tile patterns and fractal reordering; enables $2.7\times$ speedup at 90% sparsity, negligible perceptual loss.

Empirical performance on a single 80GB H100 GPU:

Model	Frames	Resolution	NFEs	Time (s)	Mem (GB)
Video Lite 5s	121	512×768	100	139	21
Video Lite Flash	121	512×768	16	35	21
Video Pro 10s	241	768×1280	100	3218*	68
Video Pro Flash	241	768×1280	16	576*	68
Image Lite	1	1024×1024	100	13	17

*With activation offloading.

5. Empirical Performance and Evaluation

Evaluation employs both human side-by-side (SBS) assessments and quantitative metrics:

Human SBS: More than 20 annotators (5-way overlap) assessed prompt following (entity/action count, property/placement), and visual quality (composition, lighting, artifact rate, realism, motion coherence) on Elementary.center.
Comparative Results:
- Video Lite vs. Sora (OpenAI) on MovieGen: exceeds Sora in motion, artifact reduction, overall quality in ≥60% of 65k judgments; prompt following is parity.
- Video Lite vs. Wan 2.2 5B/14B: outperforms on visual quality and motion; Wan leads on prompt granularity.
- Video Lite vs. Kandinsky 4.1: 59% motion/visual preference, prompt adherence parity.
- Video Pro vs. Veo 3/Fast: Veo 3 leads in prompt following; K5.0 Pro in visual quality/motion coherence.
Quantitative Metrics:
- FVD, VBench, CLIP-Score: K5.0 Pro claims state-of-the-art on VBench and FVD on MovieGen.
- Distillation: NFE reduction from 100 to 16 demonstrates negligible degradation in FVD/CLIP-Score.
Generation Efficiency:
- Video Lite Flash generates a 5s clip in ~35s on H100; Video Pro Flash (10s, 768×1280) in ~576s (with offloading).

6. Open-Source Availability and Model Extensibility

All code, training checkpoints, and pipeline integrations are freely accessible:

Repositories: GitHub (https://github.com/kandinskylab/kandinsky-5), HuggingFace (https://huggingface.co/kandinskylab).
License: MIT.
Diffusers Integration: Direct compatibility with the diffusers Python package:

from diffusers import KandinskyV5Img, KandinskyV5Video
model = KandinskyV5Img.from_pretrained("kandinskylab/kandinsky-5-image-lite")
pipe = model.pipeline()
image = pipe("A golden retriever puppy in a meadow at sunrise", guidance_scale=7.5).images[0]

Customization and Adaptation: Recipes exist for domain-specific SFT (e.g., medical, cartoon, architectural), tuning NABLA sparsity threshold for compute-quality trade-off, and extension to multi-modal tasks (captioning, audio generation). The suite supports “foundation” model construction by weight-averaging across line-ups.
Prospective Directions: Planned updates include longer context (>1024 tokens) text encoders for improved prompt alignment, unification toward a cross-modal T2I/T2V/I2I/I2V/V2A backbone, consumer-level real-time inference via further distillation/quantization, and curriculum-based pretraining to improve rare category generalization and mitigate dataset biases.

Kandinsky 5.0 represents an integrative advance in large-scale generative modeling, combining flexible transformer-diffusion architectures, robust multi-stage data pipelines, sparse attention for video, and advanced fine-tuning/distillation strategies as an open foundation for future research and application (Arkhipkin et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Kandinsky 5.0.