CogVideoX: Open Text-to-Video Diffusion Model

Updated 19 November 2025

CogVideoX is a large-scale text-to-video diffusion model that combines a custom 3D VAE, modality-adaptive transformer, and latent diffusion process to generate semantically aligned videos.
It employs a multi-stage progressive training pipeline with mixed-resolution frame packing and hierarchical captioning to enhance data utility and video quality.
The model achieves state-of-the-art benchmarks through techniques like quantization, human feedback alignment, and block-sparse attention for inference acceleration.

CogVideoX is a large-scale, open text-to-video diffusion model that combines a custom 3D spatio-temporal variational autoencoder, an expert transformer architecture with modality-adaptive LayerNorm, and advanced data and training pipelines to generate high-fidelity, semantically aligned videos from textual prompts. CogVideoX achieves state-of-the-art performance on both automatic and human evaluations, with open-source versions at the 2B and 5B parameter scales. It has become a reference architecture and testbed for numerous advances in video generative modeling, including quantization, physics modeling, model alignment with human feedback, and inference acceleration (Yang et al., 12 Aug 2024, Wang et al., 6 Dec 2024, Zhang et al., 29 May 2025, Huang et al., 16 May 2025, Gu et al., 14 Aug 2025).

1. Model Architecture and Principal Components

CogVideoX consists of three core components:

3D Causal Variational Autoencoder (VAE): Jointly compresses spatial and temporal dimensions of input frames into a latent space, mitigating sequence length explosion and preventing temporal flicker. The encoder and decoder comprise cascaded ResNet stages, with causal convolutions enforcing strict autoregressive temporal context. Overall, the VAE achieves a 4 $\times$ reduction in temporal resolution and 8 $\times$ 8 reduction in spatial resolution, while preserving temporal coherence.
Expert Transformer Backbone: Receives concatenated text and video latents, augmented with 3D Rotary Positional Encoding (3D-RoPE), and processes the sequence with multiple blocks of full spatio-temporal self-attention. Each block contains "expert adaptive LayerNorm"—distinct LayerNorm submodules for text and vision tokens, each parametrized by the diffusion timestep, addressing distributional differences and facilitating deep cross-modal fusion.
Latent Diffusion Process: The denoising function $\epsilon_\theta(x_t, t, Z)$ , parameterized as a transformer, is trained to predict the additive noise under a DDPM-style schedule ( $T=1000$ steps, v-prediction or zero-SNR regime), with the training objective minimizing the expected $\ell_2$ denoising loss.

Table: Key Model Variants

Model Variant	Parameters	Max Video Length	FPS	Resolution
CogVideoX-2B	$\sim$ 2B	6 s (48 frames)	8	720×480
CogVideoX-5B	$\sim$ 5B	6 s (48 frames)	8	720×480

The model can generate up to 10-second videos at 16 fps with 768 $\times$ 1360 resolution in specialized configurations (Yang et al., 12 Aug 2024).

2. Training Pipeline and Data Processing

CogVideoX employs a multi-stage progressive training schedule:

Stage 1: Low-resolution/short-duration pretraining (e.g., 360×640, 4 s).
Stage 2: Medium-resolution/medium-duration (720×480, 6 s).
Stage 3: High-quality finetuning on a curated dataset (watermark and subtitle removal, $\sim$ 20% of the corpus).

To maximize data utility and generalization, mixed-duration batches are enabled via a "Multi-Resolution Frame Pack" strategy that zero-pads shorter videos. Diffusion timestep sampling uses "Explicit Uniform Sampling" across data-parallel shards, which empirically accelerates convergence and smooths the training loss landscape.

Data processing includes:

Large-scale web crawl ( $\sim$ 35M video clips, avg. 6 s).
Rigorous filtering via LLM-based video classifiers (Video-LLaMA), optical flow coherence, and aesthetic scoring.
Hierarchical captioning: Panda-70M for short captions, CogVLM for dense framewise annotation, GPT-4/LLama 2 for summary distillation, and further supervised finetuning for end-to-end CogVLM2-Caption generation.

3. Evaluation and Benchmarking

CogVideoX is evaluated across diverse benchmarks including VBench, Dynamic Quality (Devil), and GPT4o-MTScore. On VBench, the 5B variant outperforms contemporary models on dynamics-centric metrics:

Model	Human Action	Scene	Dyn Degree	Mult. Objects	Appearance Style	Dynamic Qual.	GPT4o-MTScore
AnimateDiff	92.6	50.2	40.8	36.9	22.4	–	2.62
VideoCrafter2	95.0	55.3	42.5	40.7	25.1	43.6	2.68
LaVie-2	96.4	49.6	31.1	64.9	25.1	–	2.46
CogVideoX-2B	88.0	39.9	63.3	53.7	23.7	57.7	3.09
CogVideoX-5B	96.8	55.4	62.2	70.9	24.4	69.5	3.36

Human evaluation (blind comparison against Kling) shows CogVideoX-5B preferred across sensory, instruction-following, physics simulation, and coverage, with total score 2.74 vs. 2.17 (max 5).

4. Model Extensions: Physics, Human Alignment, Quantization, Acceleration

4.1 Physics Modeling with VideoREPA

Baseline CogVideoX, though achieving high visual fidelity, demonstrates poor physical commonsense: objects may deform, float, or exhibit temporally incoherent dynamics, as shown by subpar scores on VideoPhy (e.g., CogVideoX-5B: 32.3 physical commonsense (PC) on VideoPhy overall) (Zhang et al., 29 May 2025). VideoREPA introduces Token Relation Distillation (TRD) loss, aligning spatio-temporal token similarities between CogVideoX and a frozen, physics-capable video foundation model (VideoMAEv2). This alignment injects relational priors (e.g., rigid body rotation, contact dynamics), leading to gains on physical commonsense metrics: VideoREPA-5B achieves 40.1 PC on VideoPhy and 72.54 PC (vs. 67.97) on VideoPhy2, with qualitative improvements in rolling, liquid splash, and object interaction.

4.2 Human Feedback Alignment with LiFT

LiFT leverages human-annotated ratings and rationales (LiFT-HRA, 10k+ samples) to train a reward model (LiFT-Critic), then fine-tunes CogVideoX-2B via reward-weighted likelihood. LiFT-finetuned CogVideoX-2B outperforms CogVideoX-5B across all 16 VBench metrics, most notably in subject/background consistency, imaging quality, and multi-object scenes. By incorporating rationales instead of raw preference scores alone, LiFT achieves fine-grained alignment with nuanced human criteria (Wang et al., 6 Dec 2024).

4.3 Quantization and Inference Efficiency

QVGen applies quantization-aware training (QAT) to CogVideoX-2B, enabling inference at W4A4 (or W3A3) bitwidths. Auxiliary modules $\Phi$ absorb quantization error during training, then are progressively decayed via SVD-based rank reduction, allowing removal without quality loss. QVGen achieves full-precision–comparable metrics at 4 bits (e.g., Dynamic Degree 67.22 vs. 67.78, Overall Consistency 24.61 vs. 25.06) with 4× memory reduction and 1.21× throughput vs. BF16 A800; 3-bit QAT remains challenging, with some scene consistency drop (Huang et al., 16 May 2025).

4.4 Acceleration with Block-Sparse Attention (BLADE)

Video-BLADE augments CogVideoX-5B with Adaptive Block-Sparse Attention (ASA) and Trajectory Distribution Matching (TDM) step distillation. ASA prunes attention matrices into blocks, computing attention only on those above a learned importance threshold, reducing memory and computational costs from quadratic with sequence length. Distilling from a 50-step dense teacher, the 8-step ASA_GT CogVideoX-5B student (sparsity $\alpha \approx 0.82$ ) achieves 8.89× acceleration, while improving physics and creativity metrics (Physics: 0.618 vs. 0.539, Creativity: 0.546 vs. 0.458) and overall VBench-2.0 score (0.569 vs. 0.534) (Gu et al., 14 Aug 2025).

5. Limitations and Open Problems

CogVideoX, while advancing the state of the art, exhibits several limitations:

Physics simulation remains imperfect even after VideoREPA: physically implausible behaviors—e.g., melting, floating, or nonrigid deformation under force—arise due to the lack of explicit relational/structural priors in the base model (Zhang et al., 29 May 2025).
Long-duration (>48 frames) and ultra-high-resolution generalization is limited by VAE and context memory constraints; improved context-parallel or hierarchical latent approaches are needed (Yang et al., 12 Aug 2024).
Captioning pipeline, though accurate, still relies on expensive external LLMs for best results; tighter end-to-end multimodal models are under development.
Streaming/interactive video synthesis is not demonstrated.
3-bit quantization, though improved by QVGen, incurs loss of scene and overall consistency.
Fine-grained multi-actor temporal synchronization and rare background generalization remain challenging even with advanced alignment losses (Wang et al., 6 Dec 2024).

6. Future Directions and Broader Impact

CogVideoX provides a robust open foundation for further research in several directions:

Scaling model capacity (beyond 5B parameters) and temporal context for minute-long, high-resolution video synthesis.
Improved physics grounding via stronger VFM distillation, graph-relational modeling, or integration of neural physics engines.
Broader generalization via multi-modal extension (e.g., audio-video, text-image-video), and hierarchical or memory-augmented architectures.
Extending quantization and acceleration techniques to support real-time, resource-constrained deployment—particularly on edge devices.
Streamlining the data processing and captioning pipeline for full end-to-end automation.
Cross-domain generalization and safety–robustness benchmarks, especially in open-world and adversarial generative scenarios.

CogVideoX’s open release and transparent pipeline have made it a standard reference in both academic and applied settings. Its design has directly enabled advances in efficient training, model alignment, and downstream evaluation of physical and human-centric video generation performance (Yang et al., 12 Aug 2024, Wang et al., 6 Dec 2024, Zhang et al., 29 May 2025, Huang et al., 16 May 2025, Gu et al., 14 Aug 2025).