Papers
Topics
Authors
Recent
2000 character limit reached

Kandinsky 5.0: Open-Source Generative Models

Updated 20 November 2025
  • Kandinsky 5.0 is a suite of large-scale, open-source models for high-resolution text-to-image, image editing, and text-to-video generation using a unified Cross-Attention Diffusion Transformer.
  • The framework features three key variants—Image Lite, Video Lite, and Video Pro—optimized across parameter regimes from 2B to 19B with innovations like LTF and NABLA for enhanced performance.
  • It leverages extensive, curated multi-modal datasets and a multi-stage training pipeline incorporating flow matching, fine-tuning, and distillation to achieve scalable, high-quality generative outputs.

Kandinsky 5.0 is a family of large-scale, open-source foundation models for high-resolution image and video generation, comprising state-of-the-art text-to-image, in-context image editing, and text-to-video/image-to-video models. The framework features three principal model line-ups—Image Lite, Video Lite, and Video Pro—spanning a parameter regime from 2 billion to 19 billion. Kandinsky 5.0 leverages a unified Cross-Attention Diffusion Transformer (CrossDiT) backbone, an extensive, highly curated multi-modal dataset pipeline, innovative training techniques including flow matching, advanced supervised and reinforcement learning-based post-training, and extensive optimizations for scalability and throughput. All code, checkpoints, and recipes are freely available under an open-source license, targeting researchers and practitioners seeking extensible, high-quality generative capabilities (Arkhipkin et al., 19 Nov 2025).

1. Model Line-Ups and Core Architecture

Kandinsky 5.0 introduces three main model variants, each optimized for specific generative domains:

Model Variant Parameter Count Primary Tasks Max Resolution Distinctive Features
Image Lite 6B Text-to-image, in-context image editing 1408 px High resolution, image editing
Video Lite 2B Text-to-video, image-to-video (≤10s, ≤768 px) 768 px Fast, lightweight, video focus
Video Pro 19B High-quality text-to-video and image-to-video (≤10s, HD) 1408 px Superior video generation

All models employ the same architectural backbone: a latent diffusion model whose core is a transformer-based U-Net—CrossDiT. Key CrossDiT modules include interleaved self-attention, cross-attention (with text), MLP blocks with GeLU activations, residual connections after each sub-block, and adaptive normalization layers. Dual text encoders are integrated: Qwen2.5-VL (7B) for dense text representations (dimension 3584, context 256), and CLIP ViT-L/14 (dimension 768, context 77) for additional adaptive normalization. Images are encoded with FLUX.1-dev VAE, while videos use HunyuanVideo VAE for temporal consistency.

Variant-specific architectural hyperparameters are detailed as follows:

Model #CrossDiT Blocks #LTF Blocks Linear Dim Model Emb Time Emb
Image Lite 50 2 10,240 2560 512
Video Lite 32 2 7168 1792 512
Video Pro 60 4 16,384 4096 1024

Notable architectural innovations unique to Kandinsky 5.0 include the Linguistic Token Refiner (LTF) for denoising text embeddings pre-fusion, and Neighborhood Adaptive Block-Level Attention (NABLA), a dynamic, block-sparse attention mechanism optimized for video.

2. Data Curation Lifecycle

Kandinsky 5.0’s data pipeline is characterized by massive scale, heterogeneity, and rigorous filtering to support robust multi-modal generation.

  • Kandinsky T2I (text-to-image): 500 million images (LAION/COYO/web, min side ≥256 px).
  • Kandinsky T2V (text-to-video): 250 million video scenes, 2–60 seconds, various aspect ratios.
  • Kandinsky I2I (image editing instruction): ~150 million image pairs with edit descriptions.
  • Kandinsky RCC (Russian Cultural Code): 229k videos, 768k images, manually curated with bilingual captions.
  • Supervised Fine-Tuning (SFT) datasets: v1 (strict): 2,833 video/45k image; v2 (relaxed): 12,461 video/153k image.

Processing pipelines are domain-specific: images are filtered for resolution, deduplicated via perceptual pHash, cleansed of watermarks (ResNeXt101/YOLO), scored for quality (TOPIQ/Q-Align), text presence (CRAFT), and complexity (SAM 2/Sobel). Captioning uses InternVL2-26B, InternLM3-8B, and Qwen2.5VL-32B, with post-processing for text cleanliness. Video datasets undergo scene segmentation (PySceneDetect), deduplication, multi-stage technical and aesthetic quality assessment, object/scene tagging, synthetic captioning (Tarsier2-7B), and cluster-based sampling (InternVideo2-1B embeddings, 10k-cluster k-means). Instruction datasets apply similarity-based deduplication (CLIP, DINO, face), geometric verification (LoFTR + RANSAC), domain-specific exclusion heuristics, and instruction generation via fine-tuned GLM 4.5.

SFT data are domain- and subdomain-clustered with Qwen2.5-VL-Instruct-32B and CLIP, supporting granular supervised adaptation.

3. Multi-Stage Training Methodology

Kandinsky 5.0 employs a multi-regime, multi-stage training pipeline:

  • Pre-training Regimes: Four settings—text-to-image (T2I), instruct editing, text-to-video (T2V), image-to-video (I2V)—with diffusion in latent space trained via a flow matching objective:

Lflow=Et,x0,εvθ(xt,t)    xt+1xtΔt2,\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{t,\,x_0,\,\varepsilon}\Bigl\|\,v_\theta(x_t, t)\;-\;\frac{x_{t+1}-x_t}{\Delta t}\Bigr\|^2,

where xtx_t are latents, vθv_\theta is the velocity predictor, and Δt\Delta t is time discretization.

  • Fine-Tuning Regimes:
    • Image SFT: 153k images (EN/RU captions), 9 domains/2–9 subdomains. Finetune per subdomain. “Model-soup” combines checkpoints weighted by subdomain size\sqrt{\text{subdomain size}}.
    • Video SFT: 2.8k videos + 45k images per domain, two approaches—standard fine-tuning and model-soup averaging (better stability/quality).
  • Distillation:
    • CFG Distillation: Reduces from 100 to 50 NFEs, guidance scale s=5s=5, teacher trajectory regression.
    • Consistency/Adversarial Distillation: TSCD (trajectory-segmented consistency) for students with as few as 16 NFEs; adversarial hinge loss post-training (RMSprop, lrG\mathrm{lr}_G = 10610^{-6}, lrD\mathrm{lr}_D = 10410^{-4}).
  • RL-based Post-training (Image Lite):
    • Reward model: Qwen2.5-VL-7B, outputting “Yes” probability for (gen/real, prompt) tuples.
    • DRaFT-K fine-tuning: loss combines reward-based loss and KL (βKL=2102\beta_{\mathrm{KL}}=2\cdot 10^{-2}), backpropagating through the last K=10K=10 steps.

4. Training and Inference Optimization Strategies

Kandinsky 5.0 incorporates pragmatic and analytical training and inference optimizations for both throughput and resource efficiency.

  • Training Optimizations:
    • Pre-encoded VAE latents are packed in .tar archives, streamed via a 100 Gbps link, dynamically batched by aspect ratio and time budget.
    • Distributed training uses PyTorch FSDP/HSDP with sequence parallelism; text encoder on 32 GPUs, transformer on 64, leveraging NVLink interconnect and async checkpoints.
    • Activation checkpointing and host offloading reduce peak activation memory by 40%.
    • Analytical models provide a priori estimates of training step time and memory:

    Step=dd0×1d0+14SS0+6dd0  ×LB\text{Step} = \frac{d}{d_0} \times \frac{1}{\, d_0 + 14\frac{S}{S_0} + 6\frac{d}{d_0}\; \times L\,B}

    Mem=12L(9dtd+8d2+2dfd)N+max(4L(9dtd+8d2+2dfd)N,2S(Ldo+18d+2df))\mathrm{Mem} = \frac{12\,L\,(9d_t d + 8d^2 + 2d_f d)}{N} + \max\Bigl(\frac{4\,L\,(9d_t d + 8d^2 + 2d_f d)}{N},\,2S(Ld\,o + 18d +2d_f)\Bigr)

  • Inference Optimizations:

    • VAE encoder tiling with torch.compile yields 2.5×2.5\times speedup.
    • CrossDiT: torch.compile, fused kernels, MagCache (+46% throughput), Flash/Sage attention (≤5s clips), and NABLA for long/HD video.
    • NABLA: blockwise Q/K pooling (N=64), CDF sparsity per attention head, union with sliding-tile patterns and fractal reordering; enables 2.7×2.7\times speedup at 90% sparsity, negligible perceptual loss.

Empirical performance on a single 80GB H100 GPU:

Model Frames Resolution NFEs Time (s) Mem (GB)
Video Lite 5s 121 512×768 100 139 21
Video Lite Flash 121 512×768 16 35 21
Video Pro 10s 241 768×1280 100 3218* 68
Video Pro Flash 241 768×1280 16 576* 68
Image Lite 1 1024×1024 100 13 17

*With activation offloading.

5. Empirical Performance and Evaluation

Evaluation employs both human side-by-side (SBS) assessments and quantitative metrics:

  • Human SBS: More than 20 annotators (5-way overlap) assessed prompt following (entity/action count, property/placement), and visual quality (composition, lighting, artifact rate, realism, motion coherence) on Elementary.center.
  • Comparative Results:
    • Video Lite vs. Sora (OpenAI) on MovieGen: exceeds Sora in motion, artifact reduction, overall quality in ≥60% of 65k judgments; prompt following is parity.
    • Video Lite vs. Wan 2.2 5B/14B: outperforms on visual quality and motion; Wan leads on prompt granularity.
    • Video Lite vs. Kandinsky 4.1: 59% motion/visual preference, prompt adherence parity.
    • Video Pro vs. Veo 3/Fast: Veo 3 leads in prompt following; K5.0 Pro in visual quality/motion coherence.
  • Quantitative Metrics:
    • FVD, VBench, CLIP-Score: K5.0 Pro claims state-of-the-art on VBench and FVD on MovieGen.
    • Distillation: NFE reduction from 100 to 16 demonstrates negligible degradation in FVD/CLIP-Score.
  • Generation Efficiency:
    • Video Lite Flash generates a 5s clip in ~35s on H100; Video Pro Flash (10s, 768×1280) in ~576s (with offloading).

6. Open-Source Availability and Model Extensibility

All code, training checkpoints, and pipeline integrations are freely accessible:

1
2
3
4
from diffusers import KandinskyV5Img, KandinskyV5Video
model = KandinskyV5Img.from_pretrained("kandinskylab/kandinsky-5-image-lite")
pipe = model.pipeline()
image = pipe("A golden retriever puppy in a meadow at sunrise", guidance_scale=7.5).images[0]

  • Customization and Adaptation: Recipes exist for domain-specific SFT (e.g., medical, cartoon, architectural), tuning NABLA sparsity threshold for compute-quality trade-off, and extension to multi-modal tasks (captioning, audio generation). The suite supports “foundation” model construction by weight-averaging across line-ups.
  • Prospective Directions: Planned updates include longer context (>1024 tokens) text encoders for improved prompt alignment, unification toward a cross-modal T2I/T2V/I2I/I2V/V2A backbone, consumer-level real-time inference via further distillation/quantization, and curriculum-based pretraining to improve rare category generalization and mitigate dataset biases.

Kandinsky 5.0 represents an integrative advance in large-scale generative modeling, combining flexible transformer-diffusion architectures, robust multi-stage data pipelines, sparse attention for video, and advanced fine-tuning/distillation strategies as an open foundation for future research and application (Arkhipkin et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Kandinsky 5.0.