Papers
Topics
Authors
Recent
2000 character limit reached

Kandinsky 5.0 Video Lite Diffusion Model

Updated 20 November 2025
  • Kandinsky 5.0 Video Lite is a 2B-parameter latent diffusion model that efficiently generates high-resolution, temporally consistent video clips.
  • It employs advanced components such as CrossDiT blocks, NABLA block-sparse attention, and state-of-the-art VAE methods for video encoding and decoding.
  • Designed for research and industry applications, it offers rapid synthesis of up to 10-second video clips and is accessible via open-source platforms like HuggingFace diffusers.

Kandinsky 5.0 Video Lite is a 2B-parameter, foundation model for efficient, high-resolution text-to-video and image-to-video generation, comprising a core component of the Kandinsky 5.0 model suite. It is engineered for rapid synthesis of temporally consistent video clips up to 10 seconds, leveraging a latent-diffusion approach optimized via architectural innovations, targeted data curation, and advanced training methodologies. All code and model checkpoints are MIT-licensed and accessible through the HuggingFace “diffusers” library (Arkhipkin et al., 19 Nov 2025).

1. Model Architecture

Kandinsky 5.0 Video Lite adopts a latent-diffusion pipeline under the Flow Matching paradigm, wherein the model learns a deterministic mapping from noise to video data via ODE integration. The backbone—Cross-Attention Diffusion Transformer (“CrossDiT”)—incorporates 32 CrossDiT blocks and two Linguistic Token Refiner (LTF) blocks. Text conditioning employs Qwen2.5-VL (7B parameters, embedding size 3584, max context 256) for primary prompts, and CLIP ViT-L/14 (embedding 768, context 77) for additional semantic alignment.

For video encoding and decoding, FLUX.1-dev VAE is utilized for single-image latents, and HunyuanVideo VAE maintains temporal coherence within video latents. Each CrossDiT block integrates multi-head self-attention over 3D latent tokens (frames × spatial patches), cross-attention to text embeddings, and an MLP feed-forward pathway, all linked with residual connections. Inputs consist of noisy latents augmented with 3D rotary positional embeddings, MLP-provided time-step embeddings, refined linguistic tokens, CLIP text vectors, and a final fusion in an Adaptive Normalization Layer.

To achieve efficient scaling for long (up to 10 s) or high-resolution (up to 1024 px) videos, the NABLA (Neighborhood-Adaptive Block-Level Attention) mechanism is employed. NABLA applies block-wise pooling to Q, K matrices (factor N=64N=64), computes cumulative distribution functions per head on pooled scores, and sparsifies attention using a threshold mask, optionally extending sparsity with sliding-tile patterns to reduce border effects. This mechanism yields approximately 2.7×2.7\times speedup at 90%90\% block sparsity with negligible loss in FVD, CLIP, or VBench metrics.

Key components are specified mathematically:

  • Forward diffusion (variance preserving):

q(xtx0)=N(xt;αtx0,  σt2I)q(x_t\mid x_0)=\mathcal{N}\bigl(x_t;\,\alpha_t\,x_0,\;\sigma_t^2\,I\bigr)

  • Reverse (denoising) with Flow Matching:

dxdt=vθ(xt,t)\frac{dx}{dt}=v_\theta(x_t,t)

  • Classifier-Free Guidance:

ϵcfg=(1+w)ϵθ(xt,tprompt)wϵθ(xt,t)\epsilon_{\mathrm{cfg}} = (1 + w)\,\epsilon_\theta(x_t,t\mid\text{prompt}) - w\,\epsilon_\theta(x_t,t\mid\varnothing)

with w5.0w \approx 5.0

  • Multi-head attention:

Attn(Q,K,V)=softmax(QKTdh)V\mathrm{Attn}(Q,K,V)=\mathrm{softmax}\bigl(\tfrac{QK^T}{\sqrt{d_h}}\bigr)V

2. Training Pipeline

The pre-training dataset encompasses over 250 million video scenes (2–60 s each) aggregated from open platforms and public datasets. The curation pipeline integrates shot detection (PySceneDetect), minimum resolution filtering (short side ≥256 px), deduplication (video perceptual hashes), watermark removal (classifier plus YOLO averaged over 5 frames), and dynamic filtering via MS-SSIM at 2 FPS to eliminate static or hyper-dynamic clips. Technical and aesthetic quality are scored using DOVER and Q-Align, while textual and visual filters employ CRAFT, YOLOv8, CLIP, and VideoMAE modules. Synthetic captions are generated using Tarsier2-7B followed by English filtering and regex post-processing. For balanced representation, the data is clustered in embedding space (InternVideo2-1B, k-means, 10 000 clusters).

Supervised Fine-Tuning (SFT) comprises ≈2.8 k human-selected video scenes and 45 k images, organized by VLM classifier into nine domains for parallel fine-tuning (bs=64, lr=1e-5), with results aggregated by “soup” weight averaging.

The multi-stage schedule follows incremental pre-training at LR (256×256, 10 k steps), MR 5 s (up to 768×512, 50 k steps), MR 10 s (10 k steps), SFT (~10 k steps), and a final distillation stage. Batch sizes, learning rates, and weight decay are adjusted at each stage.

The infrastructure supports large-scale distributed PyTorch training—64 GPUs for the transformer, 32 for text encoders—using NVIDIA H100s (8 GPUs/node, NVLink, InfiniBand), S3-based streaming of pre-encoded VAE latents, non-blocking GPU–CPU offloading for long-sequence RL stages, and gradient/EMA stabilization (AdamW optimizer, β₁=0.9, β₂=0.95, ε=1e-8, grad-clip=1, EMA=0.9999).

3. Inference Procedures and Computational Optimization

Inference uses a baseline of 100 network function evaluations (NFEs; ~100 diffusion steps), generating 121 frames in 139 s (~0.87 FPS) on a single 80 GB NVIDIA H100. The distilled “Flash” variant reduces sampling to 16 NFEs, yielding the same output in 35 s (~3.5 FPS) with a constant ~21 GB memory footprint for both modes.

Principal optimizations include: NABLA block-sparse attention (2.7× acceleration), FlashAttention-2 or SageAttention for short clips (<5 s), MagCache for caching repeated diffusion layers (+46% speed), text encoder 8-bit quantization, and model-level refactorings using torch.compile to maximize GPU occupancy.

4. Performance Metrics and Comparative Evaluation

Automated evaluation combines Fréchet Video Distance (FVD), VBench, and CLIPScore. NABLA-sparse attention maintains near-parity against full attention on these metrics. CLIPScore indicates prompt alignment within 1–2% of compute-intensive full models.

In human side-by-side assessments on the MovieGen benchmark (1,003 prompts, 5 rater overlap):

Comparison Visual Quality Motion Dynamics Prompt Following
Kandinsky 5.0 Lite vs Sora +58% Lite +59% Lite +54% Lite
Lite vs Wan 2.2 5B +60% Lite +63% Lite –10% (Wan better)
Lite vs 4.1 Video +59% Lite +59% Lite ≃50/50 tie

Throughput, measured in FPS and resolution, is as follows:

Model Clip length Resolution FPS (full) FPS (flash)
Video Lite 5 s 121 frames 512×768 0.87 3.5
Video Lite 10 s 241 frames 512×768 1.08 3.95

5. Constraints and Prospective Enhancements

Known limitations include prompt alignment falling behind some state-of-the-art contemporaries (Qwen2.5-VL max context length 256), and degradation of physical coherence (notably for fluids or cloth) on long-range scenes exceeding 10 seconds. Dataset biases are present (cultural and object-style imbalance), and real-time 24 FPS synthesis on consumer hardware is not attainable. Foreseeable development directions are incorporation of larger-context language encoders, a unified image/video “foundation” architecture, and deeper investigations into sparsity and quantization strategies.

6. Principal Uses

Kandinsky 5.0 Video Lite addresses applications in text-to-video generation (social media ads, concept reels), image-to-video synthesis (product animation, storyboarding), content creation (prototyping, previsualization), and rapid video drafting for domains such as e-learning and micro-cinematics. Availability as open-source code and training checkpoints through the HuggingFace “diffusers” library facilitates research reproducibility and further experimentation (Arkhipkin et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Kandinsky 5.0 Video Lite.