CrossDiT Diffusion Transformer

Updated 20 November 2025

The CrossDiT Diffusion Transformer is a unified latent-diffusion model that integrates self-attention, cross-attention, and feed-forward modules for multi-modal generation.
It employs innovative flow-matching diffusion loss, advanced text-conditioning, and neighborhood-adaptive sparse attention to optimize quality and efficiency in T2I, T2V, and image editing scenarios.
Empirical evaluations demonstrate that CrossDiT-based models deliver state-of-the-art performance in fidelity, speed, and computational efficiency across varied generative tasks.

The CrossDiT (Cross-Attention Diffusion Transformer) is the architectural foundation of the Kandinsky 5.0 model family for high-fidelity image and video generation. Engineered as a unified latent-diffusion transformer backbone, CrossDiT is designed to efficiently scale across parameter budgets, tasks, and modalities—including text-to-image (T2I), text-to-video (T2V), and instructive image editing scenarios. It is trained under a flow-matching diffusion paradigm and incorporates recipe-driven optimizations for both throughput and generation quality, exemplifying a modern approach to multi-modal foundation models (Arkhipkin et al., 19 Nov 2025).

1. Model Architecture and Backbone

CrossDiT is based on a stack of modular blocks, each incorporating three principal components: self-attention, cross-attention (conditionally on language embeddings), and a feed-forward network (MLP), each surrounded by pre-normalization and residual connections. All variants use a VAE encoder/decoder—FLUX.1-dev for images and HunyuanVideo VAE for videos—to map raw pixels to a compact latent representation, which serves as input for the transformer-based denoising process.

The text-conditioning pathway uses the Qwen2.5-VL text encoder (7B parameters, outputting 3584-dimensional embeddings), refined through a lightweight Linguistic Token Refiner (LTF) module. For long sequences, especially in video, CrossDiT incorporates Neighborhood-Adaptive Block-level (NABLA) sparse attention to accelerate computation while preserving fidelity.

The architecture is parameterized according to deployment regime:

Model	Blocks	LTF blocks	Hidden dim ( $d$ )	Time dim	FF dim
Image Lite	50	2	10,240	512	2560
Video Lite	32	2	7168	512	1792
Video Pro	60	4	16,384	1024	4096

Key computational frameworks:

Diffusion loss (flow-matching):

$\mathcal{L}_\mathrm{diff}= \mathbb{E}_{t,x_0,\epsilon}\left\| \epsilon - \epsilon_\theta\bigl(x_t, t, \mathrm{text}\bigr) \right\|^2,\quad x_t = x_0 + \sigma(t)\,\epsilon,\;\epsilon\sim\mathcal{N}(0, I)$

Transformer attention (single head):

$\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} \right) V$

Flow-matching ODE divergence (RL stage):

$\mathrm{KL}(p_{RL}\!\parallel p_{SFT}) = \sum_{t \in \mathcal{T}} \| v_{RL}(x_t, t) - v_{SFT}(x_t, t) \|^2$

2. Data Pipeline and Curation

Training the CrossDiT-centric Kandinsky 5.0 models relies on extensive multi-modal curation workflows:

Text-to-Image: Approximately 500 million image-caption pairs are sourced from LAION, COYO, and similar large-scale datasets. Filtering includes minimum resolution (≥256px), perceptual-hash deduplication, watermark detection (ResNeXt 101, YOLO), technical and aesthetic scoring (TOPIQ, Q-Align), text-region exclusion (CRAFT), complexity analysis (SAM 2 and Sobel), and object/scene annotation (YOLOv8, CLIP). English captions are synthesized with InternVL2-26B and refined by InternLM3-8B; Russian with Qwen2.5-VL.
Text-to-Video: Around 250 million video scenes undergo shot segmentation (PySceneDetect), with similar filtering as T2I, supplemented by MS-SSIM motion metrics, watermark checking, DOVER & Q-Align for quality, camera/object dynamics via VideoMAE, and scene clustering using InternVideo2 embeddings with $k$ -means.
Image Editing (I2I): 150 million image pairs filtered by CLIP/DINO similarity (>0.8), RANSAC geometrics, DINO-aligned crop scoring, and other heuristics. Edits are captioned by fine-tuned GLM-4.5; a supervised subset (153k examples) is curated by Q-Align and human annotation.

Supervised fine-tuning data include ~153k image examples (dual-language captions), 2.8k video scenes and 45k images, all grouped into nine domains according to VLM classification or CLIP-linear $k$ -means.

3. Multi-Stage Training and Optimization

CrossDiT-based models follow a staged training pipeline:

Pre-training uses AdamW, progressive resolution (image: LR→MR→HR; video: 1s→5s→10s), and unconditional loss injection (10%). Batch sizes (Image Lite: 8k→2k; Video Pro: 16k→665) and LR schedules are tuned per regime. EMA decay is set to 0.9999. NABLA is applied for larger samples.
Supervised Fine-Tuning (SFT) leverages "model soups"—domain- or subdomain-specific fine-tuning at low learning rates, with checkpoint weight-averaging either equally or $\propto\sqrt{\text{size}}$ . Video SFT proceeds analogously.
Distillation, specific to Video Lite/Pro Flash models, involves CFG-guidance distillation (reducing needed function evaluations from 50 to 16 as per recent methods) and subsequent Hinge GAN post-training on re-noised frames using a Logit-Normal schedule.
RL-based Fine-Tuning (Image Lite only) employs a Qwen2.5-VL-7B reward model trained on human preference pairs and applies Direct Reward Fine-Tuning with an RL loss, regularized by KL divergence versus SFT with $\beta_{KL}=2\times10^{-2}$ .

4. Inference, Acceleration, and System Integration

Inference and system integration for CrossDiT models feature multiple efficiency innovations:

VAE Inference: Tiling and torch.compile yield 2.5× speedup and improved tile-border continuity.
Attention Scaling: For short sequences, Flash/Sage attention is used; for higher resolutions or longer videos, NABLA achieves ≈90% sparsity and 2.7× speedup.
MagCache: Diffusion-step caching accelerates generation by ≈46%.
Text Encoder Quantization: INT4 quantization minimizes memory overhead.
Distributed Training: F/HSDP + Sequence Parallel sharding over 64 GPUs, with activation checkpointing and offloading in the RL stage.
Dynamic Batching: Batches are grouped by aspect ratio and frame count $\sum_k t_k \approx t_{max}$ to maximize GPU efficiency.

Performance metrics (H100, single GPU):

Model	Frames	Resolution	NFEs	Time (s)	Memory (GB)
Video Lite 5s	121	512×768	100	139	21
Video Lite 5s Flash	121	512×768	16	35	21
Video Pro 10s	241	512×768	100	1158	51
Video Pro 10s Flash	241	512×768	16	242	51
Image Lite	1	1024×1024	100	13	17

5. Empirical Evaluation and Comparison

Automated metrics and human evaluation demonstrate state-of-the-art performance:

COCO-30k FID (Image Lite): Outperforms Stable Diffusion 2.1 and DALL·E 2.
Video Metrics (Video Lite): FVD/CLIP/VBench parity or improvement over Wan 2.2 5B and Sora.
Side-by-Side Human Study (MovieGen, 1,002 prompts, 5× overlap):
- Video Lite vs Sora: 62% object/action fidelity preference, 59% motion/visual quality.
- Video Lite vs Wan 2.2 5B/A14B: higher visual/motion scores, ~10–15 point drop in prompt adherence.
- Video Lite vs K4.1: motion (+59%), 27% fewer artifacts.
- Video Pro vs Veo 3/Fast: outperforms in quality/dynamics, <15% gap in prompt following.
- Flash-distilled models show <10% drop in visual/motion criteria for substantial speedup.
- Image Lite exceeds FLUX.1 and Qwen-Image in quality (+0.08) and prompt following (+0.05).
- Instructional editing matches or exceeds FLUX/Kontext for instruction and aesthetics.

6. Applications and Deployment Recommendations

CrossDiT models enable high-resolution T2I (up to 1408px) for commercial, social, and product content; I2I for inpainting, style transfer, and sketch-to-photo translation; and T2V/I2V (5s/10s, up to 1408px, 24fps) for storyboarding, marketing, and virtual sets.

Deployment best practices include:

Domain-specific SFT soups for style/content adaptation.
Text encoder quantization (INT4) for inference efficiency.
Flash/NABLA attention for latency-critical use.
Flash variants (Video Lite/Pro) for ultra-fast sampling.
Prompt engineering following standardized templates:
- T2I: “[Subject] + [Style/Lighting] + [Details]”
- T2V: “[Subject] + [Action] + [Environment] + [Cinematic params]”

Example usage (Python, diffusers):

from diffusers import DiffusionPipeline, KandinskyImagePipeline, KandinskyVideoPipeline

pipe_img = KandinskyImagePipeline.from_pretrained(
    "kandinskylab/kandinsky-5-image-lite",
    torch_dtype=torch.float16
).to("cuda")
image = pipe_img("A photorealistic portrait of a fox in a forest")[0]

pipe_vid = KandinskyVideoPipeline.from_pretrained(
    "kandinskylab/kandinsky-5-video-lite",
    torch_dtype=torch.float16
).to("cuda")
video = pipe_vid("A sleek red car racing down a neon city street at night")[0]
video[0].save("out.mp4", fps=24)

input_frame = load_image("street.png")
i2v = KandinskyVideoPipeline.from_pretrained(
    "kandinskylab/kandinsky-5-video-lite",
    torch_dtype=torch.float16
).to("cuda")
clip = i2v(prompt="Camera pans upward", init_frame=input_frame)[0]
clip.save("pan.mp4", fps=24)

7. Context and Impact within Foundation Models

CrossDiT, as instantiated in Kandinsky 5.0, exemplifies a modular, scalable approach to generative multi-modal learning, unifying high-fidelity image and video synthesis in a single transformer design leveraging latent diffusion, advanced attention mechanisms, and large-scale curation pipelines. Its open-source (MIT) release and comprehensive checkpoints facilitate downstream adaptation and benchmarking. The backbone integrates recent innovations in diffusion acceleration, clustering-based data domainization, and reward-model-driven fine-tuning, contributing to the accessible deployment of foundation models for both research and applied creative domains (Arkhipkin et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CrossDiT Diffusion Transformer.