Papers
Topics
Authors
Recent
2000 character limit reached

CrossDiT Diffusion Transformer

Updated 20 November 2025
  • The CrossDiT Diffusion Transformer is a unified latent-diffusion model that integrates self-attention, cross-attention, and feed-forward modules for multi-modal generation.
  • It employs innovative flow-matching diffusion loss, advanced text-conditioning, and neighborhood-adaptive sparse attention to optimize quality and efficiency in T2I, T2V, and image editing scenarios.
  • Empirical evaluations demonstrate that CrossDiT-based models deliver state-of-the-art performance in fidelity, speed, and computational efficiency across varied generative tasks.

The CrossDiT (Cross-Attention Diffusion Transformer) is the architectural foundation of the Kandinsky 5.0 model family for high-fidelity image and video generation. Engineered as a unified latent-diffusion transformer backbone, CrossDiT is designed to efficiently scale across parameter budgets, tasks, and modalities—including text-to-image (T2I), text-to-video (T2V), and instructive image editing scenarios. It is trained under a flow-matching diffusion paradigm and incorporates recipe-driven optimizations for both throughput and generation quality, exemplifying a modern approach to multi-modal foundation models (Arkhipkin et al., 19 Nov 2025).

1. Model Architecture and Backbone

CrossDiT is based on a stack of modular blocks, each incorporating three principal components: self-attention, cross-attention (conditionally on language embeddings), and a feed-forward network (MLP), each surrounded by pre-normalization and residual connections. All variants use a VAE encoder/decoder—FLUX.1-dev for images and HunyuanVideo VAE for videos—to map raw pixels to a compact latent representation, which serves as input for the transformer-based denoising process.

The text-conditioning pathway uses the Qwen2.5-VL text encoder (7B parameters, outputting 3584-dimensional embeddings), refined through a lightweight Linguistic Token Refiner (LTF) module. For long sequences, especially in video, CrossDiT incorporates Neighborhood-Adaptive Block-level (NABLA) sparse attention to accelerate computation while preserving fidelity.

The architecture is parameterized according to deployment regime:

Model Blocks LTF blocks Hidden dim (dd) Time dim FF dim
Image Lite 50 2 10,240 512 2560
Video Lite 32 2 7168 512 1792
Video Pro 60 4 16,384 1024 4096

Key computational frameworks:

  • Diffusion loss (flow-matching):

Ldiff=Et,x0,ϵϵϵθ(xt,t,text)2,xt=x0+σ(t)ϵ,  ϵN(0,I)\mathcal{L}_\mathrm{diff}= \mathbb{E}_{t,x_0,\epsilon}\left\| \epsilon - \epsilon_\theta\bigl(x_t, t, \mathrm{text}\bigr) \right\|^2,\quad x_t = x_0 + \sigma(t)\,\epsilon,\;\epsilon\sim\mathcal{N}(0, I)

  • Transformer attention (single head):

Attn(Q,K,V)=softmax(QKd)V\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} \right) V

  • Flow-matching ODE divergence (RL stage):

KL(pRL ⁣pSFT)=tTvRL(xt,t)vSFT(xt,t)2\mathrm{KL}(p_{RL}\!\parallel p_{SFT}) = \sum_{t \in \mathcal{T}} \| v_{RL}(x_t, t) - v_{SFT}(x_t, t) \|^2

2. Data Pipeline and Curation

Training the CrossDiT-centric Kandinsky 5.0 models relies on extensive multi-modal curation workflows:

  • Text-to-Image: Approximately 500 million image-caption pairs are sourced from LAION, COYO, and similar large-scale datasets. Filtering includes minimum resolution (≥256px), perceptual-hash deduplication, watermark detection (ResNeXt 101, YOLO), technical and aesthetic scoring (TOPIQ, Q-Align), text-region exclusion (CRAFT), complexity analysis (SAM 2 and Sobel), and object/scene annotation (YOLOv8, CLIP). English captions are synthesized with InternVL2-26B and refined by InternLM3-8B; Russian with Qwen2.5-VL.
  • Text-to-Video: Around 250 million video scenes undergo shot segmentation (PySceneDetect), with similar filtering as T2I, supplemented by MS-SSIM motion metrics, watermark checking, DOVER & Q-Align for quality, camera/object dynamics via VideoMAE, and scene clustering using InternVideo2 embeddings with kk-means.
  • Image Editing (I2I): 150 million image pairs filtered by CLIP/DINO similarity (>0.8), RANSAC geometrics, DINO-aligned crop scoring, and other heuristics. Edits are captioned by fine-tuned GLM-4.5; a supervised subset (153k examples) is curated by Q-Align and human annotation.

Supervised fine-tuning data include ~153k image examples (dual-language captions), 2.8k video scenes and 45k images, all grouped into nine domains according to VLM classification or CLIP-linear kk-means.

3. Multi-Stage Training and Optimization

CrossDiT-based models follow a staged training pipeline:

  • Pre-training uses AdamW, progressive resolution (image: LR→MR→HR; video: 1s→5s→10s), and unconditional loss injection (10%). Batch sizes (Image Lite: 8k→2k; Video Pro: 16k→665) and LR schedules are tuned per regime. EMA decay is set to 0.9999. NABLA is applied for larger samples.
  • Supervised Fine-Tuning (SFT) leverages "model soups"—domain- or subdomain-specific fine-tuning at low learning rates, with checkpoint weight-averaging either equally or size\propto\sqrt{\text{size}}. Video SFT proceeds analogously.
  • Distillation, specific to Video Lite/Pro Flash models, involves CFG-guidance distillation (reducing needed function evaluations from 50 to 16 as per recent methods) and subsequent Hinge GAN post-training on re-noised frames using a Logit-Normal schedule.
  • RL-based Fine-Tuning (Image Lite only) employs a Qwen2.5-VL-7B reward model trained on human preference pairs and applies Direct Reward Fine-Tuning with an RL loss, regularized by KL divergence versus SFT with βKL=2×102\beta_{KL}=2\times10^{-2}.

4. Inference, Acceleration, and System Integration

Inference and system integration for CrossDiT models feature multiple efficiency innovations:

  • VAE Inference: Tiling and torch.compile yield 2.5× speedup and improved tile-border continuity.
  • Attention Scaling: For short sequences, Flash/Sage attention is used; for higher resolutions or longer videos, NABLA achieves ≈90% sparsity and 2.7× speedup.
  • MagCache: Diffusion-step caching accelerates generation by ≈46%.
  • Text Encoder Quantization: INT4 quantization minimizes memory overhead.
  • Distributed Training: F/HSDP + Sequence Parallel sharding over 64 GPUs, with activation checkpointing and offloading in the RL stage.
  • Dynamic Batching: Batches are grouped by aspect ratio and frame count ktktmax\sum_k t_k \approx t_{max} to maximize GPU efficiency.

Performance metrics (H100, single GPU):

Model Frames Resolution NFEs Time (s) Memory (GB)
Video Lite 5s 121 512×768 100 139 21
Video Lite 5s Flash 121 512×768 16 35 21
Video Pro 10s 241 512×768 100 1158 51
Video Pro 10s Flash 241 512×768 16 242 51
Image Lite 1 1024×1024 100 13 17

5. Empirical Evaluation and Comparison

Automated metrics and human evaluation demonstrate state-of-the-art performance:

  • COCO-30k FID (Image Lite): Outperforms Stable Diffusion 2.1 and DALL·E 2.
  • Video Metrics (Video Lite): FVD/CLIP/VBench parity or improvement over Wan 2.2 5B and Sora.
  • Side-by-Side Human Study (MovieGen, 1,002 prompts, 5× overlap):
    • Video Lite vs Sora: 62% object/action fidelity preference, 59% motion/visual quality.
    • Video Lite vs Wan 2.2 5B/A14B: higher visual/motion scores, ~10–15 point drop in prompt adherence.
    • Video Lite vs K4.1: motion (+59%), 27% fewer artifacts.
    • Video Pro vs Veo 3/Fast: outperforms in quality/dynamics, <15% gap in prompt following.
    • Flash-distilled models show <10% drop in visual/motion criteria for substantial speedup.
    • Image Lite exceeds FLUX.1 and Qwen-Image in quality (+0.08) and prompt following (+0.05).
    • Instructional editing matches or exceeds FLUX/Kontext for instruction and aesthetics.

6. Applications and Deployment Recommendations

CrossDiT models enable high-resolution T2I (up to 1408px) for commercial, social, and product content; I2I for inpainting, style transfer, and sketch-to-photo translation; and T2V/I2V (5s/10s, up to 1408px, 24fps) for storyboarding, marketing, and virtual sets.

Deployment best practices include:

  • Domain-specific SFT soups for style/content adaptation.
  • Text encoder quantization (INT4) for inference efficiency.
  • Flash/NABLA attention for latency-critical use.
  • Flash variants (Video Lite/Pro) for ultra-fast sampling.
  • Prompt engineering following standardized templates:
    • T2I: “[Subject] + [Style/Lighting] + [Details]”
    • T2V: “[Subject] + [Action] + [Environment] + [Cinematic params]”

Example usage (Python, diffusers):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from diffusers import DiffusionPipeline, KandinskyImagePipeline, KandinskyVideoPipeline

pipe_img = KandinskyImagePipeline.from_pretrained(
    "kandinskylab/kandinsky-5-image-lite",
    torch_dtype=torch.float16
).to("cuda")
image = pipe_img("A photorealistic portrait of a fox in a forest")[0]

pipe_vid = KandinskyVideoPipeline.from_pretrained(
    "kandinskylab/kandinsky-5-video-lite",
    torch_dtype=torch.float16
).to("cuda")
video = pipe_vid("A sleek red car racing down a neon city street at night")[0]
video[0].save("out.mp4", fps=24)

input_frame = load_image("street.png")
i2v = KandinskyVideoPipeline.from_pretrained(
    "kandinskylab/kandinsky-5-video-lite",
    torch_dtype=torch.float16
).to("cuda")
clip = i2v(prompt="Camera pans upward", init_frame=input_frame)[0]
clip.save("pan.mp4", fps=24)

7. Context and Impact within Foundation Models

CrossDiT, as instantiated in Kandinsky 5.0, exemplifies a modular, scalable approach to generative multi-modal learning, unifying high-fidelity image and video synthesis in a single transformer design leveraging latent diffusion, advanced attention mechanisms, and large-scale curation pipelines. Its open-source (MIT) release and comprehensive checkpoints facilitate downstream adaptation and benchmarking. The backbone integrates recent innovations in diffusion acceleration, clustering-based data domainization, and reward-model-driven fine-tuning, contributing to the accessible deployment of foundation models for both research and applied creative domains (Arkhipkin et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CrossDiT Diffusion Transformer.