Kandinsky 5.0 Video Lite Diffusion Model
- Kandinsky 5.0 Video Lite is a 2B-parameter latent diffusion model that efficiently generates high-resolution, temporally consistent video clips.
- It employs advanced components such as CrossDiT blocks, NABLA block-sparse attention, and state-of-the-art VAE methods for video encoding and decoding.
- Designed for research and industry applications, it offers rapid synthesis of up to 10-second video clips and is accessible via open-source platforms like HuggingFace diffusers.
Kandinsky 5.0 Video Lite is a 2B-parameter, foundation model for efficient, high-resolution text-to-video and image-to-video generation, comprising a core component of the Kandinsky 5.0 model suite. It is engineered for rapid synthesis of temporally consistent video clips up to 10 seconds, leveraging a latent-diffusion approach optimized via architectural innovations, targeted data curation, and advanced training methodologies. All code and model checkpoints are MIT-licensed and accessible through the HuggingFace “diffusers” library (Arkhipkin et al., 19 Nov 2025).
1. Model Architecture
Kandinsky 5.0 Video Lite adopts a latent-diffusion pipeline under the Flow Matching paradigm, wherein the model learns a deterministic mapping from noise to video data via ODE integration. The backbone—Cross-Attention Diffusion Transformer (“CrossDiT”)—incorporates 32 CrossDiT blocks and two Linguistic Token Refiner (LTF) blocks. Text conditioning employs Qwen2.5-VL (7B parameters, embedding size 3584, max context 256) for primary prompts, and CLIP ViT-L/14 (embedding 768, context 77) for additional semantic alignment.
For video encoding and decoding, FLUX.1-dev VAE is utilized for single-image latents, and HunyuanVideo VAE maintains temporal coherence within video latents. Each CrossDiT block integrates multi-head self-attention over 3D latent tokens (frames × spatial patches), cross-attention to text embeddings, and an MLP feed-forward pathway, all linked with residual connections. Inputs consist of noisy latents augmented with 3D rotary positional embeddings, MLP-provided time-step embeddings, refined linguistic tokens, CLIP text vectors, and a final fusion in an Adaptive Normalization Layer.
To achieve efficient scaling for long (up to 10 s) or high-resolution (up to 1024 px) videos, the NABLA (Neighborhood-Adaptive Block-Level Attention) mechanism is employed. NABLA applies block-wise pooling to Q, K matrices (factor ), computes cumulative distribution functions per head on pooled scores, and sparsifies attention using a threshold mask, optionally extending sparsity with sliding-tile patterns to reduce border effects. This mechanism yields approximately speedup at block sparsity with negligible loss in FVD, CLIP, or VBench metrics.
Key components are specified mathematically:
- Forward diffusion (variance preserving):
- Reverse (denoising) with Flow Matching:
- Classifier-Free Guidance:
with
- Multi-head attention:
2. Training Pipeline
The pre-training dataset encompasses over 250 million video scenes (2–60 s each) aggregated from open platforms and public datasets. The curation pipeline integrates shot detection (PySceneDetect), minimum resolution filtering (short side ≥256 px), deduplication (video perceptual hashes), watermark removal (classifier plus YOLO averaged over 5 frames), and dynamic filtering via MS-SSIM at 2 FPS to eliminate static or hyper-dynamic clips. Technical and aesthetic quality are scored using DOVER and Q-Align, while textual and visual filters employ CRAFT, YOLOv8, CLIP, and VideoMAE modules. Synthetic captions are generated using Tarsier2-7B followed by English filtering and regex post-processing. For balanced representation, the data is clustered in embedding space (InternVideo2-1B, k-means, 10 000 clusters).
Supervised Fine-Tuning (SFT) comprises ≈2.8 k human-selected video scenes and 45 k images, organized by VLM classifier into nine domains for parallel fine-tuning (bs=64, lr=1e-5), with results aggregated by “soup” weight averaging.
The multi-stage schedule follows incremental pre-training at LR (256×256, 10 k steps), MR 5 s (up to 768×512, 50 k steps), MR 10 s (10 k steps), SFT (~10 k steps), and a final distillation stage. Batch sizes, learning rates, and weight decay are adjusted at each stage.
The infrastructure supports large-scale distributed PyTorch training—64 GPUs for the transformer, 32 for text encoders—using NVIDIA H100s (8 GPUs/node, NVLink, InfiniBand), S3-based streaming of pre-encoded VAE latents, non-blocking GPU–CPU offloading for long-sequence RL stages, and gradient/EMA stabilization (AdamW optimizer, β₁=0.9, β₂=0.95, ε=1e-8, grad-clip=1, EMA=0.9999).
3. Inference Procedures and Computational Optimization
Inference uses a baseline of 100 network function evaluations (NFEs; ~100 diffusion steps), generating 121 frames in 139 s (~0.87 FPS) on a single 80 GB NVIDIA H100. The distilled “Flash” variant reduces sampling to 16 NFEs, yielding the same output in 35 s (~3.5 FPS) with a constant ~21 GB memory footprint for both modes.
Principal optimizations include: NABLA block-sparse attention (2.7× acceleration), FlashAttention-2 or SageAttention for short clips (<5 s), MagCache for caching repeated diffusion layers (+46% speed), text encoder 8-bit quantization, and model-level refactorings using torch.compile to maximize GPU occupancy.
4. Performance Metrics and Comparative Evaluation
Automated evaluation combines Fréchet Video Distance (FVD), VBench, and CLIPScore. NABLA-sparse attention maintains near-parity against full attention on these metrics. CLIPScore indicates prompt alignment within 1–2% of compute-intensive full models.
In human side-by-side assessments on the MovieGen benchmark (1,003 prompts, 5 rater overlap):
| Comparison | Visual Quality | Motion Dynamics | Prompt Following |
|---|---|---|---|
| Kandinsky 5.0 Lite vs Sora | +58% Lite | +59% Lite | +54% Lite |
| Lite vs Wan 2.2 5B | +60% Lite | +63% Lite | –10% (Wan better) |
| Lite vs 4.1 Video | +59% Lite | +59% Lite | ≃50/50 tie |
Throughput, measured in FPS and resolution, is as follows:
| Model | Clip length | Resolution | FPS (full) | FPS (flash) |
|---|---|---|---|---|
| Video Lite 5 s | 121 frames | 512×768 | 0.87 | 3.5 |
| Video Lite 10 s | 241 frames | 512×768 | 1.08 | 3.95 |
5. Constraints and Prospective Enhancements
Known limitations include prompt alignment falling behind some state-of-the-art contemporaries (Qwen2.5-VL max context length 256), and degradation of physical coherence (notably for fluids or cloth) on long-range scenes exceeding 10 seconds. Dataset biases are present (cultural and object-style imbalance), and real-time 24 FPS synthesis on consumer hardware is not attainable. Foreseeable development directions are incorporation of larger-context language encoders, a unified image/video “foundation” architecture, and deeper investigations into sparsity and quantization strategies.
6. Principal Uses
Kandinsky 5.0 Video Lite addresses applications in text-to-video generation (social media ads, concept reels), image-to-video synthesis (product animation, storyboarding), content creation (prototyping, previsualization), and rapid video drafting for domains such as e-learning and micro-cinematics. Availability as open-source code and training checkpoints through the HuggingFace “diffusers” library facilitates research reproducibility and further experimentation (Arkhipkin et al., 19 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free