Motion-Conditioned Video Diffusion

Updated 26 November 2025

The paper presents a motion-conditioned model that integrates motion cues with denoising diffusion for controlled video synthesis and anomaly detection.
It fuses spatiotemporal features from 3D CNNs with compact representations like dynamic images and Star RGB for effective motion encoding.
Empirical evaluations show notable improvements in temporal consistency, AUC gains, and cross-dataset generalization compared to unconditional approaches.

A motion-conditioned video diffusion architecture is a class of generative model that explicitly integrates motion-specific representations or controls into the denoising diffusion process for videos. Unlike unconditional video diffusion, motion-conditioned architectures utilize spatiotemporal motion cues—such as compact motion representations, tracklet sequences, optical flow, and trajectory embeddings—to guide the synthesis or assessment of video frames. This paradigm allows for fine-grained control, improved temporal consistency, and enhanced diagnosis of temporal anomalies, leveraging both data-driven and explicit conditioning mechanisms. Such models are central to contemporary solutions in video generation, controllable synthesis, and anomaly detection, spanning unsupervised, supervised, and zero-shot inference protocols.

1. Motion Representation Extraction

Motion-conditioned video diffusion models require both appearance and motion representations to govern the synthesis task or anomaly detection process. Typical pipelines extract two forms of motion information:

Spatiotemporal latent features: A pre-trained 3D CNN (e.g., 3D-ResNext101 or 3D-ResNet18) processes a video clip $C\in\mathbb R^{N\times3\times H\times W}$ , producing a clip-level feature vector:

$\mathrm{fea} = \mathcal F(C) \in \mathbb R^f$

with $f=2048$ (ResNext101) or $f=512$ (ResNet18).

Compact motion representations: For efficient yet expressive motion cues, two options are prevalent:
1. Star RGB: Collapses $N$ frames into a single RGB image, partitioning frame segments per channel and aggregating pairwise cosine similarities across pixel time series.
2. Dynamic Image:
$d^* = \sum_{k=1}^N \alpha_k I_k,\quad \alpha_k=2k-N-1$

where $I_k\in\mathbb R^{3\times H\times W}$ is the k-th RGB frame. Both representations are then embedded by a 2D-CNN (ResNet18/ResNet50):

$\mathrm{cond} = F_{\mathrm{cond}}(\text{motion-image}) \in \mathbb R^c$

with $c=512$ or $2048$.

This fusion of appearance and compact motion encoding provides a rich, generalizable condition for unsupervised anomaly detection, and similar representations underpin other controlled generation frameworks (Tur et al., 2023).

2. Conditioning Mechanism and Integration

Motion conditioning is realized by injecting compact motion features into the denoising network at multiple points:

Pre-conditioning (k-Diffusion protocol): The diffusion step applies

$D_\theta(x;\sigma_t) = c_{\mathrm{skip}}(\sigma_t) \, x + c_{\mathrm{out}}(\sigma_t) \, G_\theta \bigl( c_{\mathrm{in}}(\sigma_t) \, x;\, c_{\mathrm{noise}}(\sigma_t) \bigr)$

Time embedding (Fourier features of noise scalar $\sigma_t$ ) modulates each encoder/decoder layer via FiLM (feature-wise linear modulation).

Motion injection: The $\mathrm{cond}$ vector is projected to match hidden block width and is added element-wise to block activations directly after FiLM injection, ensuring the denoising network is motion-aware at every depth.
Generalization: This protocol is agnostic to the network’s base architecture (fully-connected MLP for compact feature vectors, U-Net for spatially resolved latents), and supports diverse forms of motion conditions such as tracklets, box trajectories, and compact images (Tur et al., 2023, Li et al., 2023).

3. Diffusion Process Formulation

The training and inference procedures both hinge on standard DDPM-style formulations:

Forward process (noising):

$q(x_t\mid x_{t-1}) = \mathcal N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf I)$

Reverse process (denoising) with conditioning:

$p_\theta(x_{t-1} \mid x_t, c) = \mathcal N \bigl(x_{t-1};\, \mu_\theta(x_t, t, c),\, \sigma_t^2 \mathbf I \bigr)$

where $c$ includes both time embedding and compact motion representation.

Each reverse step reconstructs from corrupted latent features, leveraging motion-encoded conditioning for improved recovery of normal patterns or for controlled synthesis.

4. Network Architecture and Feature Injection

The denoising backbone is typically a multi-layer perceptron (MLP) for compact non-image features or a U-Net for high-dimensional latent video cubes.

MLP-based denoiser $G_\theta$ :
- Encoder: Linear( $f \rightarrow 1024$ ) → ReLU → Linear( 1024→512 ) → ReLU → Linear( 512→256 ) → ReLU
- Decoder: Linear( 256→256 ) → ReLU → ... → Linear( 1024→f )
- At every block, time embedding via FiLM and motion encoding via linear addition.
Feature map structure: All internal representations are 1D, as spatial/temporal structure is pre-abstracted by feature extractors.
Motion fusion points: In deeper architectures, motion tokens or representations may be distributed via gated self-attention or cross-attention (TrackDiffusion (Li et al., 2023)), or by direct addition/projection in simple MLP setups.

This injection scheme ensures conditioning operates both globally (across layers) and locally (within blocks), supporting fine-grained anomaly scoring or controlled video synthesis.

5. Loss Functions and Evaluation Metrics

Model learning and anomaly diagnosis are governed by noise-prediction losses and data-driven scoring protocols:

Training loss ("simple noise-prediction"):

$\mathcal L_{\mathrm{simple}} = \mathbb E_{t,\,x_0,\,\epsilon} \left\| \epsilon - \epsilon_\theta (x_t, t, \mathrm{cond}) \right\|_2^2$

with $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$ .

Anomaly scoring (VAD):

Extract test feature vector $x_0$ .
Corrupt to $x_t$ at chosen noise level.
Reconstruct via reverse chain to $\hat{x}_0$ .
Assign anomaly score:

$S(C) = \| x_0 - \hat{x}_0 \|_2^2$
Threshold by batch statistics: $\tau = \mu + k\sigma$ ; flag anomalous if $S > \tau$ .

Generalization and performance: Conditioning on compact motion yields typical absolute AUC gains (1−2%) on benchmarks (ShanghaiTech, UCF-Crime), and enhances cross-dataset scalability with further absolute improvements (8−17%) (Tur et al., 2023).

6. Variants, Extensions, and Benchmark Results

Numerous architectures extend this motion-conditioned paradigm:

TrackDiffusion: Embeds fine-grained box trajectories as conditioning tokens, with instance enhancer modules and cross-attention, yielding improved object tracking in generated videos (Li et al., 2023).
MOFT framework: Extracts motion features via PCA of intermediate U-Net activations, allowing training-free steering/control of cross-frame dynamics (Xiao et al., 23 May 2024).
Skeleton-based architectures: Use learned skeleton encoders for anomaly detection, leveraging multimodal future synthesis and advanced regularization protocols (MoCoDAD (Flaborea et al., 2023), DCMD (Wang et al., 23 Dec 2024)).
VAD pipeline: Patch-based architectures integrate both appearance and motion-encoded local patches, with learnable memory banks and dynamic skip-connections for fine-grained anomaly detection (Zhou et al., 12 Dec 2024).

Empirical validation consistently demonstrates improved motion fidelity, anomaly localization, and generalization to out-of-domain video distributions.

Motion-conditioned video diffusion architecture combines pretrained spatiotemporal feature extractors, compact and expressive motion representations, and targeted conditioning within denoising diffusion networks. The architecture injects motion cues through FiLM or additive modulation at every network depth, and leverages reconstruction-based evaluation to deliver state-of-the-art unsupervised anomaly detection and controlled video synthesis outcomes (Tur et al., 2023, Li et al., 2023, Xiao et al., 23 May 2024, Zhou et al., 12 Dec 2024).