Generative Video Models

Updated 25 September 2025

Generative video models are deep learning methods that synthesize high-dimensional, time-varying visual data into coherent video sequences.
They integrate discrete tokenization, continuous latent diffusion, and flow-based approaches to capture spatial and temporal dependencies.
Applications include video prediction, editing, and compression, impacting fields from entertainment to robotics and virtual reality.

Generative video models are a class of machine learning models that learn the joint distribution of high-dimensional, time-varying visual data, enabling both synthesis and probabilistic prediction of video content. They unify unsupervised representation learning and generative modeling, extending principles from image synthesis into the spatio-temporal domain. Contemporary approaches encompass discrete and continuous latent variable models, probabilistic diffusion processes, normalizing flows, adversarial frameworks, and large-scale self-supervised training. The goal is to generate temporally coherent, spatially realistic, and semantically meaningful video sequences, thereby supporting tasks ranging from video prediction and inpainting to controllable synthesis, compression, editing, and multi-modal applications.

1. Modelling Principles and Discrete Approaches

Early generative video models adapt sequence modeling techniques from natural language processing. The foundational approach, as seen in "Video (language) modeling: a baseline for generative models of natural videos" (Ranzato et al., 2014), is to quantize each video frame into a grid of discrete tokens by clustering small image patches (e.g., 8×8) via k-means into a large codebook (e.g., 10,000 centroids). Each patch is thus represented by a discrete symbol, enabling the direct application of n-gram, neural LLMs, and recurrent neural networks (rNNs):

$p(X_t|X_{t-n+1},\ldots,X_{t-1}) = \frac{\mathrm{count}(X_{t-n+1},\ldots,X_t)+1}{\mathrm{count}(X_{t-n+1},\ldots,X_{t-1})+V}$

where $V$ is the codebook size.

A recurrent Convolutional Neural Network (rCNN) is then used to incorporate both spatial (processing a 9×9 patch grid) and temporal dependencies, mapping spatial neighborhoods into embedding vectors that are updated over time and decoded to predict the most likely patch centroid for each spatial location. This approach enables classification-based prediction (rather than regression), reducing blurring and capturing complex motion patterns over short video horizons.

Recent discrete approaches employ hierarchical vector-quantized autoencoders (VQ-VAEs) to encode videos into sequences of discrete tokens and apply autoregressive Transformer architectures for likelihood-based modeling over these tokens, as exemplified in VideoGPT (Yan et al., 2021). Here, a 3D convolutional VQ-VAE compresses the video, with axial self-attention further capturing long-range context without the memory cost of full 3D attention. The Transformer prior models $\prod_{i=1}^N p(z_i|z_1,\ldots,z_{i-1})$ where $z$ are the discrete video tokens, using spatio-temporal position encoding to retain structural consistency.

2. Continuous Latent Representations and Diffusion Models

Generative video models increasingly rely on learning continuous latent representations for both computational tractability and sample quality. Video autoencoders and diffusion models project high-dimensional video cubes into compact latent spaces:

Projected latent video diffusion models (PVDM) (Yu et al., 2023) demonstrate that projecting videos into a small set of 2D latent maps—separating static (content) and dynamic (motion over different axes) information—enables efficient training of diffusion models on high-resolution, long video sequences. The diffusion process is then defined over these latent maps, maintaining efficiency and temporal coherence.
CV-VAE (Zhao et al., 30 May 2024) describes a "compatible" video VAE where the latent space of a spatio-temporally compressed 3D VAE is aligned with that of an image VAE (e.g., from Stable Diffusion). Latent space regularization ensures compatibility, enabling direct transfer of pretrained text-to-image models for video generation with minimal retraining, thus enhancing efficiency and facilitating ultra-long, smooth-generation with strong motion continuity.

Photorealistic video generation via latent diffusion (Yu, 26 May 2024) operates entirely in a compressed latent space, leveraging 3D spatio-temporal networks for the denoising process and introducing lookup-free quantization (LFQ) to support large vocabularies without expensive codebooks. A temporal consistency loss is incorporated to penalize flickering between frames:

$L_\text{temporal} = \mathbb{E}_{t}[\| x_t - \text{TemporalAvg}(x_t) \|^2_2]$

GD-VDM (Lapid et al., 2023) and related two-stage diffusion models further improve the semantic fidelity of complex scenes by conditioning RGB generation on an initial depth video synthesized from a diffusion model, thus disentangling structure and appearance.

3. Stochastic, Flow-Based, and Hybrid Models

Normalizing flows for video generation, as in VideoFlow (Kumar et al., 2019), model each frame using invertible transformations, making data likelihoods exactly computable. A multi-scale flow maps $x_t \mapsto z_t$ with

$\log p(x) = \log p(z) + \sum_{i=1}^K \log |\det(\partial h_i/\partial h_{i-1})|$

where $\{h_i\}$ are intermediate representations.

Temporal dependencies are modeled by an autoregressive latent prior: $p(z_1, \ldots, z_T) = \prod_{t=1}^T p(z_t|z_{<t})$ with multi-scale latent variable partitioning. Flow-based models offer sharp stochastic predictions, exact likelihoods, and efficient sampling.

Low-rank models (Hyder et al., 2019) approach video representation as recovering sequences of latent codes $z_t$ for each frame, employing explicit temporal regularization:

Similarity constraint: Encourages $z_{t+1} \approx z_t$ .
Low-rank constraint: All $\{z_t\}$ should lie in a low-dimensional subspace, enforced via SVD or PCA projection.

Frame interpolation is achieved by linear code interpolation: $z_\text{interp} = (1-t)z_\text{start} + t z_\text{end}$ .

4. Advancements in Architecture: 3D-Aware and Multi-View Models

Contemporary models address fundamental view-consistency and 3D-awareness:

3D-aware GANs (Bahmani et al., 2022) inject neural implicit scene representations (e.g., NeRF-style MLPs) and decompose the latent space into content (static 3D geometry/appearance) and motion (dynamic features). Temporal discriminators enforce video coherence, and latent space interpolation enables smooth transitions in pose, viewpoint, and object motion.
VideoMV (Zuo et al., 18 Mar 2024) demonstrates that off-the-shelf video models incorporating temporal attention are significantly more consistent for multi-view image synthesis than 2D image-based counterparts. A 3D-aware denoising sampling process combines explicit 3D Gaussian models, rendered back into the denoising loop, providing robust multi-view consistency and rapid 3D asset generation.

5. Evaluation: Metrics, Benchmarks, and Human Alignment

Traditional metrics (PSNR, SSIM) are insufficient to capture the temporal and perceptual qualities of generated videos. Fréchet Video Distance (FVD) (Unterthiner et al., 2018) extends FID to video by embedding entire clips using I3D (trained on Kinetics) and measuring the 2-Wasserstein distance between multivariate Gaussians:

$\text{FVD} = \|\mu_R - \mu_G\|^2 + \mathrm{Tr}(\Sigma_R + \Sigma_G - 2\sqrt{\Sigma_R \Sigma_G})$

FVD is shown to correlate strongly with human judgment across quality, diversity, and motion coherence. StarCraft 2 Video (SCV) benchmarks further isolate video generation challenges into compositional "unit tests" for motion, memory, and relational modeling.

On face- and semantic-specific video, new databases for Generative Face Video Coding (Chen et al., 9 Jun 2025) introduce mean opinion scores (MOS) and favor perceptual metrics (LPIPS, DISTS, TOPIQ) aligned to human perception.

Generative video models have direct impact on practical applications:

Compression: GFVC (Chen et al., 9 Jun 2025) replaces pixel-level video coding with generative synthesis from compact high-level features (e.g., landmarks, motion fields), outperforming VVC at ultra-low bitrates on perceptual quality.
Editing and Propagation: Unified frameworks such as GenProp (Liu et al., 27 Dec 2024) propagate edits made on the first frame throughout the video using a selective content encoder and an image-to-video diffusion model. Region-aware losses and a mask prediction decoder separate edited from non-edited regions, supporting advanced tasks like object insertion, deletion, and effect removal.
Open-World Video Foundation Models: InternVideo (Wang et al., 2022) integrates generative masked video modeling (masked autoencoders with tube masking and space-time attention) with discriminative video-language contrastive learning, coordinated via cross-model attention, supporting action recognition, retrieval, and open-world tasks across 39 benchmarks.
Security and Forensics: VGMShield (Pang et al., 20 Feb 2024) covers detection, attribution, and misuse prevention for generated videos, exploiting spatial-temporal inconsistencies revealed by specificity-tuned feature backbones (I3D, MAE). Adversarial perturbations applied to seed images (for image-to-video pipelines) can thwart unauthorized generation.

7. Scalability and Future Directions

Large-scale generative video models now achieve state-of-the-art visual quality and motion coherence previously available only in closed commercial systems. HunyuanVideo (Kong et al., 3 Dec 2024), an open-source text-to-video diffusion model with 13B parameters, leverages a unified 3D VAE–Transformer backbone, RoPE-based 3D position encodings, progressive curriculum training, and flow matching objectives:

$L_\text{generation} = \mathbb{E}_{t,x_0,x_1}\|\mathbf{v}_t - \mathbf{u}_t\|^2$

with $\mathbf{u}_t$ as the true latent velocity between noised and clean samples.

Further, DRA-Ctrl (Cao et al., 29 May 2025) shows that pretrained video generators, through dimension-reduction and transition schemes (employing high-regularity, mixup-based interpolation across spatial/temporal dimensions and tailored attention masks), outperform image-trained models on controllable synthesis (e.g., subject-driven generation, spatially aligned maps).

Adaptive multi-modal architectures enable vision-LLMs to control video synthesis and facilitate integration with audio and other modalities (Yu, 26 May 2024).

Key prospects for the field include:

More efficient and robust video compression and understanding through tighter integration of generative and discriminative objectives.
Hierarchical, interpretable latent spaces for editability, traceability, and multi-modal conditional sampling.
Dynamic, unified frameworks capable of real-time, multi-modal, and multi-view video synthesis and manipulation, bridging the gap between foundation text models and generative video models.

Generative video models presently stand as both a benchmark for progress in deep generative modeling and as an enabling technology for broad areas spanning entertainment, robotics, virtual reality, video communication, and beyond.