Kandinsky 5.0 Video Pro: Advanced Video Synthesis
- Kandinsky 5.0 Video Pro is a cutting-edge text-to-video synthesis model that employs a 19B-parameter transformer, latent diffusion, and novel attention techniques for high-resolution, up-to-10-second video generation.
- It utilizes the CrossDiT framework with 60 stacked blocks that integrate self-attention, cross-attention, and feed-forward modules to efficiently encode complex spatio-temporal features.
- The model features a robust multi-stage training and fine-tuning pipeline with extensive data curation and optimization strategies, achieving state-of-the-art performance and open-source research accessibility.
Kandinsky 5.0 Video Pro is a foundation model for high-resolution, up-to-10-second video synthesis, representing the most advanced architecture within the Kandinsky 5.0 generative model suite. It leverages a 19-billion-parameter transformer, integrating latent diffusion within a flow-matching paradigm with novel attention and optimization strategies. The system enables state-of-the-art video generation and forms a platform for both research and practical generative applications, with an emphasis on open-source accessibility and modular fine-tuning capabilities (Arkhipkin et al., 19 Nov 2025).
1. Architecture and Parameterization
Kandinsky 5.0 Video Pro centers on the Cross-Attention Diffusion Transformer (CrossDiT) framework, achieving scalability and synthesis quality through several architectural innovations. The model’s parameter allocation is as follows: Qwen2.5-VL text encoders (7B parameters), CrossDiT visual backbone and linguistic token-refiner (LTF) modules (≈10B parameters), auxiliary CLIP ViT-L/14 for embedding (0.3B parameters), and VAE encoders/decoders with miscellaneous heads (≈1.7B parameters).
The CrossDiT backbone comprises 60 stacked blocks, each containing self-attention, cross-attention (against refined text queries), and a feed-forward MLP sub-block, with all sub-blocks linked by residual connections: where refers to the visual latents at layer , and is the text embedding stream. Critical dimensions include a linear attention key/query dimension of 16,384, hidden dimension of 4,096, and time embedding dimension of 1,024. Text conditioning proceeds via Qwen2.5-VL followed by four LTF blocks (CrossDiT sans cross-attention) to mitigate positional encoding bias. Rotary Position Embeddings (RoPE) and sinusoidal time embeddings are inserted at each block. High-dimensional video latents are formed by compressing input frames using a HunyuanVideo VAE, with 3D rotary positional embeddings for spatio-temporal addressing.
A distinctive scalability feature is Neighborhood-Adaptive Block-Level Attention (NABLA). NABLA pools key/query tokens into size-64 blocks, sparsifies attention by cumulative-density-function thresholds, and remaps sparse selections to the original indices, accommodating sliding‐tile unions to minimize boundary artifacts. This attention scheme reduces both training and inference costs (up to 2.7× acceleration) without extra loss terms.
2. Data Curation and Multi-Stage Training
The training pipeline begins with a 250M scene text-to-video (T2V) corpus curated via multi-step quality and diversity filtering. Scene segmentation utilizes PySceneDetect to produce 2–60 second clips, applying initial filters based on pixel resolution (≥256 px) and deduplication through perceptual hashing. Advanced filtering encompasses watermark detection (classifier + YOLO), frame-wise MS-SSIM for structural dynamics, DOVER and Q-Align for technical and aesthetic scoring, CRAFT for text presence, object taxonomy (YOLOv8/CLIP), and camera/object assessment with VideoMAE.
Synthetic captioning using Tarsier2-7B, with regex removal of generic or non-Latin text, standardizes textual conditioning. Clustering on InternVideo2-1B embeddings (10,000 K-means clusters) enables balanced sampling by visual content, and videos are sharded by three resolution tiers (256, 512, 1024 pixels).
For post-training, the model leverages the Russian Cultural Code (RCC; ≈230k scenes with bilingual annotation) and a Supervised Fine-Tuning (SFT) set (~2.8k “strict”, 12.5k “relaxed” videos, labeled by 16 visual criteria and nine domain categories).
Pre-training is conducted in four stages, each employing joint T2I, T2V, and I2V optimization at sampling frequencies of 2%, 77%, and 21%, respectively. The loss is the flow-matching mean squared error: augmented by 10% unconditional (null-prompt) sampling. SFT proceeds by “model souping”: separate fine-tuning per domain (batch size 64, learning rate 1e-5), aggregated via weighted averaging (weights ∝ √dataset size), yielding higher visual fidelity and prompt adherence compared to monolithic fine-tuning.
Distillation forms Flash variants, reducing the number of Neural Function Evaluations (NFE) from 100 to 16 through an initial Classifier-Free Guidance (CFG) pass (to 50 NFEs) and a Trajectory-Segmented Consistency Distillation (TSCD) pass, the latter supplemented by adversarial post-training and re-noising to stabilize discrimination objectives.
3. Optimization and Inference Workflows
Beyond NABLA, several system-level innovations drive efficient inference and scalable training. VAE encoder tiling, fused input/output, and Torch Compile deliver 2.5× speedup in latent encoding. Flash and Sage Attention are used for standard-definition or sub-5-second video, with NABLA reserved for higher-resolution or longer sequences. Intermediate MagCache implementation accelerates diffusion step computation by approximately 46%. Large-scale parallelism is achieved through Fully Sharded Data Parallel (FSDP) and sequence parallelism across up to 64 GPUs, with asynchronous activation checkpointing and host offloading during reinforcement learning phases. Analytical scaling models for step time and VRAM utilization inform batch size and data sharding decisions.
4. Empirical Performance and Benchmarking
The inference costs for Video Pro and its Flash variant on a single NVIDIA H100 (80 GB VRAM) are as follows:
| Variant | Resolution (px) | Frames | NFEs | 10 s Video Time (s) |
|---|---|---|---|---|
| Video Pro | 512×768 | 241 | 100 | 1,158 |
| Flash (distilled) | 512×768 | 241 | 16 | 242 |
Video Pro attains leading scores on VBench and FVD leaderboards and demonstrates a +1.8 CLIP-score improvement over Kandinsky 4.1. Human side-by-side (SBS) evaluation using MovieGen (1,003 prompts, 44 raters, ~65k judgments, 5× overlap, 71% inter-rater agreement) reveals superior results versus peer models. Against Veo 3/3 Fast, Video Pro leads in Visual Quality (≈0.62–0.68) and Motion Dynamics (≈0.72–0.75) but underperforms in Prompt Following (≈0.44 vs. 0.58). Similar trends are observed versus Wan 2.2 A14B, with Video Pro excelling in Temporal Coherence but yielding modest gaps in finegrained semantic alignment. Compared to its predecessor, Kandinsky 4.1, it gains +0.10 in Visual Quality and +0.15 in Motion Dynamics, while holding parity or slight improvement in prompt fidelity (Arkhipkin et al., 19 Nov 2025).
5. Application Scope and Extension Modalities
Kandinsky 5.0 Video Pro functions as a stand-alone text-to-video (T2V) generator and serves as a base for fine-tuned expansion. In image-to-video (I2V) mode, the architecture generates motion by fixing the initial frame and incrementally noising subsequent frames. For conditional video editing, the CrossDiT core enables text-driven framewise modifications using I2I data samples. Domain-specialized variants are formed by adjusting the weights within the SFT-soup aggregation, permitting “softer” or “cinematic” generative moods. Prospective multimodal extensions include integration with audio-generative heads and simulation via world-model planning modules.
Kandinsky 5.0 Video Pro is released with open-source code and publicly available training checkpoints, supporting further research, domain transfer, and industry application across the generative modeling landscape (Arkhipkin et al., 19 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free