CrossDiT Diffusion Transformer
- The CrossDiT Diffusion Transformer is a unified latent-diffusion model that integrates self-attention, cross-attention, and feed-forward modules for multi-modal generation.
- It employs innovative flow-matching diffusion loss, advanced text-conditioning, and neighborhood-adaptive sparse attention to optimize quality and efficiency in T2I, T2V, and image editing scenarios.
- Empirical evaluations demonstrate that CrossDiT-based models deliver state-of-the-art performance in fidelity, speed, and computational efficiency across varied generative tasks.
The CrossDiT (Cross-Attention Diffusion Transformer) is the architectural foundation of the Kandinsky 5.0 model family for high-fidelity image and video generation. Engineered as a unified latent-diffusion transformer backbone, CrossDiT is designed to efficiently scale across parameter budgets, tasks, and modalities—including text-to-image (T2I), text-to-video (T2V), and instructive image editing scenarios. It is trained under a flow-matching diffusion paradigm and incorporates recipe-driven optimizations for both throughput and generation quality, exemplifying a modern approach to multi-modal foundation models (Arkhipkin et al., 19 Nov 2025).
1. Model Architecture and Backbone
CrossDiT is based on a stack of modular blocks, each incorporating three principal components: self-attention, cross-attention (conditionally on language embeddings), and a feed-forward network (MLP), each surrounded by pre-normalization and residual connections. All variants use a VAE encoder/decoder—FLUX.1-dev for images and HunyuanVideo VAE for videos—to map raw pixels to a compact latent representation, which serves as input for the transformer-based denoising process.
The text-conditioning pathway uses the Qwen2.5-VL text encoder (7B parameters, outputting 3584-dimensional embeddings), refined through a lightweight Linguistic Token Refiner (LTF) module. For long sequences, especially in video, CrossDiT incorporates Neighborhood-Adaptive Block-level (NABLA) sparse attention to accelerate computation while preserving fidelity.
The architecture is parameterized according to deployment regime:
| Model | Blocks | LTF blocks | Hidden dim () | Time dim | FF dim |
|---|---|---|---|---|---|
| Image Lite | 50 | 2 | 10,240 | 512 | 2560 |
| Video Lite | 32 | 2 | 7168 | 512 | 1792 |
| Video Pro | 60 | 4 | 16,384 | 1024 | 4096 |
Key computational frameworks:
- Diffusion loss (flow-matching):
- Transformer attention (single head):
- Flow-matching ODE divergence (RL stage):
2. Data Pipeline and Curation
Training the CrossDiT-centric Kandinsky 5.0 models relies on extensive multi-modal curation workflows:
- Text-to-Image: Approximately 500 million image-caption pairs are sourced from LAION, COYO, and similar large-scale datasets. Filtering includes minimum resolution (≥256px), perceptual-hash deduplication, watermark detection (ResNeXt 101, YOLO), technical and aesthetic scoring (TOPIQ, Q-Align), text-region exclusion (CRAFT), complexity analysis (SAM 2 and Sobel), and object/scene annotation (YOLOv8, CLIP). English captions are synthesized with InternVL2-26B and refined by InternLM3-8B; Russian with Qwen2.5-VL.
- Text-to-Video: Around 250 million video scenes undergo shot segmentation (PySceneDetect), with similar filtering as T2I, supplemented by MS-SSIM motion metrics, watermark checking, DOVER & Q-Align for quality, camera/object dynamics via VideoMAE, and scene clustering using InternVideo2 embeddings with -means.
- Image Editing (I2I): 150 million image pairs filtered by CLIP/DINO similarity (>0.8), RANSAC geometrics, DINO-aligned crop scoring, and other heuristics. Edits are captioned by fine-tuned GLM-4.5; a supervised subset (153k examples) is curated by Q-Align and human annotation.
Supervised fine-tuning data include ~153k image examples (dual-language captions), 2.8k video scenes and 45k images, all grouped into nine domains according to VLM classification or CLIP-linear -means.
3. Multi-Stage Training and Optimization
CrossDiT-based models follow a staged training pipeline:
- Pre-training uses AdamW, progressive resolution (image: LR→MR→HR; video: 1s→5s→10s), and unconditional loss injection (10%). Batch sizes (Image Lite: 8k→2k; Video Pro: 16k→665) and LR schedules are tuned per regime. EMA decay is set to 0.9999. NABLA is applied for larger samples.
- Supervised Fine-Tuning (SFT) leverages "model soups"—domain- or subdomain-specific fine-tuning at low learning rates, with checkpoint weight-averaging either equally or . Video SFT proceeds analogously.
- Distillation, specific to Video Lite/Pro Flash models, involves CFG-guidance distillation (reducing needed function evaluations from 50 to 16 as per recent methods) and subsequent Hinge GAN post-training on re-noised frames using a Logit-Normal schedule.
- RL-based Fine-Tuning (Image Lite only) employs a Qwen2.5-VL-7B reward model trained on human preference pairs and applies Direct Reward Fine-Tuning with an RL loss, regularized by KL divergence versus SFT with .
4. Inference, Acceleration, and System Integration
Inference and system integration for CrossDiT models feature multiple efficiency innovations:
- VAE Inference: Tiling and torch.compile yield 2.5× speedup and improved tile-border continuity.
- Attention Scaling: For short sequences, Flash/Sage attention is used; for higher resolutions or longer videos, NABLA achieves ≈90% sparsity and 2.7× speedup.
- MagCache: Diffusion-step caching accelerates generation by ≈46%.
- Text Encoder Quantization: INT4 quantization minimizes memory overhead.
- Distributed Training: F/HSDP + Sequence Parallel sharding over 64 GPUs, with activation checkpointing and offloading in the RL stage.
- Dynamic Batching: Batches are grouped by aspect ratio and frame count to maximize GPU efficiency.
Performance metrics (H100, single GPU):
| Model | Frames | Resolution | NFEs | Time (s) | Memory (GB) |
|---|---|---|---|---|---|
| Video Lite 5s | 121 | 512×768 | 100 | 139 | 21 |
| Video Lite 5s Flash | 121 | 512×768 | 16 | 35 | 21 |
| Video Pro 10s | 241 | 512×768 | 100 | 1158 | 51 |
| Video Pro 10s Flash | 241 | 512×768 | 16 | 242 | 51 |
| Image Lite | 1 | 1024×1024 | 100 | 13 | 17 |
5. Empirical Evaluation and Comparison
Automated metrics and human evaluation demonstrate state-of-the-art performance:
- COCO-30k FID (Image Lite): Outperforms Stable Diffusion 2.1 and DALL·E 2.
- Video Metrics (Video Lite): FVD/CLIP/VBench parity or improvement over Wan 2.2 5B and Sora.
- Side-by-Side Human Study (MovieGen, 1,002 prompts, 5× overlap):
- Video Lite vs Sora: 62% object/action fidelity preference, 59% motion/visual quality.
- Video Lite vs Wan 2.2 5B/A14B: higher visual/motion scores, ~10–15 point drop in prompt adherence.
- Video Lite vs K4.1: motion (+59%), 27% fewer artifacts.
- Video Pro vs Veo 3/Fast: outperforms in quality/dynamics, <15% gap in prompt following.
- Flash-distilled models show <10% drop in visual/motion criteria for substantial speedup.
- Image Lite exceeds FLUX.1 and Qwen-Image in quality (+0.08) and prompt following (+0.05).
- Instructional editing matches or exceeds FLUX/Kontext for instruction and aesthetics.
6. Applications and Deployment Recommendations
CrossDiT models enable high-resolution T2I (up to 1408px) for commercial, social, and product content; I2I for inpainting, style transfer, and sketch-to-photo translation; and T2V/I2V (5s/10s, up to 1408px, 24fps) for storyboarding, marketing, and virtual sets.
Deployment best practices include:
- Domain-specific SFT soups for style/content adaptation.
- Text encoder quantization (INT4) for inference efficiency.
- Flash/NABLA attention for latency-critical use.
- Flash variants (Video Lite/Pro) for ultra-fast sampling.
- Prompt engineering following standardized templates:
- T2I: “[Subject] + [Style/Lighting] + [Details]”
- T2V: “[Subject] + [Action] + [Environment] + [Cinematic params]”
Example usage (Python, diffusers):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from diffusers import DiffusionPipeline, KandinskyImagePipeline, KandinskyVideoPipeline pipe_img = KandinskyImagePipeline.from_pretrained( "kandinskylab/kandinsky-5-image-lite", torch_dtype=torch.float16 ).to("cuda") image = pipe_img("A photorealistic portrait of a fox in a forest")[0] pipe_vid = KandinskyVideoPipeline.from_pretrained( "kandinskylab/kandinsky-5-video-lite", torch_dtype=torch.float16 ).to("cuda") video = pipe_vid("A sleek red car racing down a neon city street at night")[0] video[0].save("out.mp4", fps=24) input_frame = load_image("street.png") i2v = KandinskyVideoPipeline.from_pretrained( "kandinskylab/kandinsky-5-video-lite", torch_dtype=torch.float16 ).to("cuda") clip = i2v(prompt="Camera pans upward", init_frame=input_frame)[0] clip.save("pan.mp4", fps=24) |
7. Context and Impact within Foundation Models
CrossDiT, as instantiated in Kandinsky 5.0, exemplifies a modular, scalable approach to generative multi-modal learning, unifying high-fidelity image and video synthesis in a single transformer design leveraging latent diffusion, advanced attention mechanisms, and large-scale curation pipelines. Its open-source (MIT) release and comprehensive checkpoints facilitate downstream adaptation and benchmarking. The backbone integrates recent innovations in diffusion acceleration, clustering-based data domainization, and reward-model-driven fine-tuning, contributing to the accessible deployment of foundation models for both research and applied creative domains (Arkhipkin et al., 19 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free