Papers
Topics
Authors
Recent
2000 character limit reached

Kandinsky 5.0 Video Lite Diffusion Model

Updated 20 November 2025
  • Kandinsky 5.0 Video Lite is a 2B-parameter latent diffusion model that efficiently generates high-resolution, temporally consistent video clips.
  • It employs advanced components such as CrossDiT blocks, NABLA block-sparse attention, and state-of-the-art VAE methods for video encoding and decoding.
  • Designed for research and industry applications, it offers rapid synthesis of up to 10-second video clips and is accessible via open-source platforms like HuggingFace diffusers.

Kandinsky 5.0 Video Lite is a 2B-parameter, foundation model for efficient, high-resolution text-to-video and image-to-video generation, comprising a core component of the Kandinsky 5.0 model suite. It is engineered for rapid synthesis of temporally consistent video clips up to 10 seconds, leveraging a latent-diffusion approach optimized via architectural innovations, targeted data curation, and advanced training methodologies. All code and model checkpoints are MIT-licensed and accessible through the HuggingFace “diffusers” library (Arkhipkin et al., 19 Nov 2025).

1. Model Architecture

Kandinsky 5.0 Video Lite adopts a latent-diffusion pipeline under the Flow Matching paradigm, wherein the model learns a deterministic mapping from noise to video data via ODE integration. The backbone—Cross-Attention Diffusion Transformer (“CrossDiT”)—incorporates 32 CrossDiT blocks and two Linguistic Token Refiner (LTF) blocks. Text conditioning employs Qwen2.5-VL (7B parameters, embedding size 3584, max context 256) for primary prompts, and CLIP ViT-L/14 (embedding 768, context 77) for additional semantic alignment.

For video encoding and decoding, FLUX.1-dev VAE is utilized for single-image latents, and HunyuanVideo VAE maintains temporal coherence within video latents. Each CrossDiT block integrates multi-head self-attention over 3D latent tokens (frames × spatial patches), cross-attention to text embeddings, and an MLP feed-forward pathway, all linked with residual connections. Inputs consist of noisy latents augmented with 3D rotary positional embeddings, MLP-provided time-step embeddings, refined linguistic tokens, CLIP text vectors, and a final fusion in an Adaptive Normalization Layer.

To achieve efficient scaling for long (up to 10 s) or high-resolution (up to 1024 px) videos, the NABLA (Neighborhood-Adaptive Block-Level Attention) mechanism is employed. NABLA applies block-wise pooling to Q, K matrices (factor N=64N=64), computes cumulative distribution functions per head on pooled scores, and sparsifies attention using a threshold mask, optionally extending sparsity with sliding-tile patterns to reduce border effects. This mechanism yields approximately 2.7×2.7\times speedup at 90%90\% block sparsity with negligible loss in FVD, CLIP, or VBench metrics.

Key components are specified mathematically:

  • Forward diffusion (variance preserving):

q(xtx0)=N(xt;αtx0,  σt2I)q(x_t\mid x_0)=\mathcal{N}\bigl(x_t;\,\alpha_t\,x_0,\;\sigma_t^2\,I\bigr)

  • Reverse (denoising) with Flow Matching:

dxdt=vθ(xt,t)\frac{dx}{dt}=v_\theta(x_t,t)

  • Classifier-Free Guidance:

ϵcfg=(1+w)ϵθ(xt,tprompt)wϵθ(xt,t)\epsilon_{\mathrm{cfg}} = (1 + w)\,\epsilon_\theta(x_t,t\mid\text{prompt}) - w\,\epsilon_\theta(x_t,t\mid\varnothing)

with w5.0w \approx 5.0

  • Multi-head attention:

Attn(Q,K,V)=softmax(QKTdh)V\mathrm{Attn}(Q,K,V)=\mathrm{softmax}\bigl(\tfrac{QK^T}{\sqrt{d_h}}\bigr)V

2. Training Pipeline

The pre-training dataset encompasses over 250 million video scenes (2–60 s each) aggregated from open platforms and public datasets. The curation pipeline integrates shot detection (PySceneDetect), minimum resolution filtering (short side ≥256 px), deduplication (video perceptual hashes), watermark removal (classifier plus YOLO averaged over 5 frames), and dynamic filtering via MS-SSIM at 2 FPS to eliminate static or hyper-dynamic clips. Technical and aesthetic quality are scored using DOVER and Q-Align, while textual and visual filters employ CRAFT, YOLOv8, CLIP, and VideoMAE modules. Synthetic captions are generated using Tarsier2-7B followed by English filtering and regex post-processing. For balanced representation, the data is clustered in embedding space (InternVideo2-1B, k-means, 10 000 clusters).

Supervised Fine-Tuning (SFT) comprises ≈2.8 k human-selected video scenes and 45 k images, organized by VLM classifier into nine domains for parallel fine-tuning (bs=64, lr=1e-5), with results aggregated by “soup” weight averaging.

The multi-stage schedule follows incremental pre-training at LR (256×256, 10 k steps), MR 5 s (up to 768×512, 50 k steps), MR 10 s (10 k steps), SFT (~10 k steps), and a final distillation stage. Batch sizes, learning rates, and weight decay are adjusted at each stage.

The infrastructure supports large-scale distributed PyTorch training—64 GPUs for the transformer, 32 for text encoders—using NVIDIA H100s (8 GPUs/node, NVLink, InfiniBand), S3-based streaming of pre-encoded VAE latents, non-blocking GPU–CPU offloading for long-sequence RL stages, and gradient/EMA stabilization (AdamW optimizer, β₁=0.9, β₂=0.95, ε=1e-8, grad-clip=1, EMA=0.9999).

3. Inference Procedures and Computational Optimization

Inference uses a baseline of 100 network function evaluations (NFEs; ~100 diffusion steps), generating 121 frames in 139 s (~0.87 FPS) on a single 80 GB NVIDIA H100. The distilled “Flash” variant reduces sampling to 16 NFEs, yielding the same output in 35 s (~3.5 FPS) with a constant ~21 GB memory footprint for both modes.

Principal optimizations include: NABLA block-sparse attention (2.7× acceleration), FlashAttention-2 or SageAttention for short clips (<5 s), MagCache for caching repeated diffusion layers (+46% speed), text encoder 8-bit quantization, and model-level refactorings using torch.compile to maximize GPU occupancy.

4. Performance Metrics and Comparative Evaluation

Automated evaluation combines Fréchet Video Distance (FVD), VBench, and CLIPScore. NABLA-sparse attention maintains near-parity against full attention on these metrics. CLIPScore indicates prompt alignment within 1–2% of compute-intensive full models.

In human side-by-side assessments on the MovieGen benchmark (1,003 prompts, 5 rater overlap):

Comparison Visual Quality Motion Dynamics Prompt Following
Kandinsky 5.0 Lite vs Sora +58% Lite +59% Lite +54% Lite
Lite vs Wan 2.2 5B +60% Lite +63% Lite –10% (Wan better)
Lite vs 4.1 Video +59% Lite +59% Lite ≃50/50 tie

Throughput, measured in FPS and resolution, is as follows:

Model Clip length Resolution FPS (full) FPS (flash)
Video Lite 5 s 121 frames 512×768 0.87 3.5
Video Lite 10 s 241 frames 512×768 1.08 3.95

5. Constraints and Prospective Enhancements

Known limitations include prompt alignment falling behind some state-of-the-art contemporaries (Qwen2.5-VL max context length 256), and degradation of physical coherence (notably for fluids or cloth) on long-range scenes exceeding 10 seconds. Dataset biases are present (cultural and object-style imbalance), and real-time 24 FPS synthesis on consumer hardware is not attainable. Foreseeable development directions are incorporation of larger-context language encoders, a unified image/video “foundation” architecture, and deeper investigations into sparsity and quantization strategies.

6. Principal Uses

Kandinsky 5.0 Video Lite addresses applications in text-to-video generation (social media ads, concept reels), image-to-video synthesis (product animation, storyboarding), content creation (prototyping, previsualization), and rapid video drafting for domains such as e-learning and micro-cinematics. Availability as open-source code and training checkpoints through the HuggingFace “diffusers” library facilitates research reproducibility and further experimentation (Arkhipkin et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Kandinsky 5.0 Video Lite.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube