Snap Video: Scalable Text-to-Video Synthesis

Updated 5 November 2025

Snap Video is a generative text-to-video synthesis framework that integrates spatiotemporal transformers with a custom diffusion process to achieve temporally consistent, high-fidelity videos.
The framework employs a Far-reaching Interleaved Transformer (FIT) backbone to efficiently compress spatial and temporal redundancies, enabling scalable training and faster inference.
Innovations such as input scaling and joint image-video training address SNR preservation and modality mismatches, setting a new benchmark in text-to-video generation quality.

Snap Video refers to a generative modeling framework and model family for text-to-video synthesis, built around large-scale spatiotemporal transformers and an extended diffusion process specifically designed for videos. The framework systematically addresses the efficiency and quality limitations of prior video generation approaches, moving beyond adaptations of image models by introducing architectural and algorithmic innovations that enable scalable, high-fidelity, and temporally consistent video synthesis. The model is characterized by its transformer-based Far-reaching Interleaved Transformer (FIT) backbone, a principled extension of the EDM (Elucidating the Design Space of Diffusion Models) framework to the video domain, and architectures tailored for exploiting spatial and temporal redundancy in video data (Menapace et al., 22 Feb 2024).

1. Motivations and Core Challenges

Text-to-video synthesis presents unique algorithmic and computational challenges. Prior methods often repurpose still-image diffusion models (U-Nets) and extend them with basic temporal modules; however, this approach is fundamentally misaligned with video’s high spatial and temporal redundancy. The key domain challenges are:

Spatiotemporal Redundancy: Consecutive frames exhibit strong correlations, making naive spatial processing redundant and inefficient.
Scalability Constraints: U-Net-based networks scale linearly in both compute and memory with increasing video length and resolution, severely limiting batch size, parameter count, and training throughput for video.
Temporal Coherence: Many models based on frame-wise image synthesis produce dynamic images rather than intrinsic videos, resulting in flickering or inconsistent object motion.
Train-Inference Mismatch: Existing diffusion training paradigms, when naively applied to videos, lead to undertrained high-frequency motion representations due to the aggregation of noise over many frames ("averaging effect").

Snap Video was designed to address these domain-specific inefficiencies via video-first architectural design and revised generative training objectives (Menapace et al., 22 Feb 2024).

2. Spatiotemporal Transformer Architecture (FIT) and Scalability

Far-reaching Interleaved Transformer (FIT)

The core innovation is the adoption of a transformer-based architecture that handles video as a joint sequence of spatiotemporal tokens:

Patch Tokenization: Input video is split into non-overlapping spatial patches per frame (e.g., 4×4 patches, flattened across all frames → sequence length proportional to $T H W$ ).
Latent Tokens: A fixed set of learnable latent tokens (e.g., 768 for the 3.9B-parameter model) absorb and process patch information.
Interleaved Processing: Each FIT block consists of:
- Cross-attention from patch tokens to latent tokens (read operations),
- Self-attention among latent tokens (global context),
- Feedforward updates to both patch and latent tokens,
- Optionally, conditioning on prompt, noise, and video metadata.
Self-conditioning: Latent tokens maintain compressed, decodable representations across denoising steps, facilitating efficient iterative refinement.

Architecture Hyperparameters (Select Configurations):

Model Size	Patch Size	Latent Tokens	Input Resolution
3.9B	1×4×4	768	16 × 512 × 288 px video
500M	1×4×4	512	16 × 64 × 40 px video

The FIT design enables global, joint spatiotemporal modeling at scale, avoiding the inefficiencies of U-Net's per-frame spatial computation and limited temporal communication.

Scalability and Efficiency

Training: FIT models train 3.31× faster than comparably sized U-Nets; batch sizes of 2048 videos and 2048 images are achievable, supporting model scaling to billions of parameters.
Inference: FIT is 4.49× faster than a U-Net of similar size at inference for video sequences of the same length and resolution. For a 3.9B-parameter FIT, inference is only ~1.24× slower than a 500M U-Net.
Resource Allocation: By concentrating computation in a compressible latent space, FIT supports larger model and batch sizes without quadratic memory scaling in sequence length (Menapace et al., 22 Feb 2024).

3. Diffusion Process: EDM Adaptation for Video

EDM Forward Process and Problematic SNR

The EDM forward process for images is: $p(x_{\sigma}|x) \sim \mathcal{N}(x, \sigma^2 I)$ with loss: $\mathcal{L}(D) = \mathbb{E}_{\sigma, x, n} \left[ \lambda(\sigma) \| D(x_{\sigma}) - x \|^2_2 \right]$ Directly extending this to video leads to a discrepancy: as the number of frames and spatial resolution increase, the effective SNR rises (due to averaging of noise across many correlated pixels), making the denoising task artificially easier during training compared to inference.

Input Scaling

Snap Video introduces an input scaling factor: $\alpha = s \sqrt{T}$ where $s$ is the spatial upsampling ratio and $T$ is the number of frames. The forward diffusion process becomes: $p(x_{\sigma}|x) \sim \mathcal{N}\left(\frac{x}{\alpha}, \sigma^2 I\right)$ This preserves per-location SNR as video resolution or frame count increases, maintaining faithful modeling of the underlying data distribution.

The revised denoising objective is: $\mathcal{L}(F) = \mathbb{E}_{\sigma, x, n} \left[ w(\sigma) \left\| F(\text{in}(\sigma)x_{\sigma}) - \text{nrm}(\sigma)F \right\|_2^2 \right]$ where

$\text{in}(\sigma) = \frac{1}{\sqrt{\sigma^2 / \alpha^2 + 1}} \qquad \text{nrm}(\sigma) = \frac{1}{\sqrt{\sigma^2 + 1}} \qquad w(\sigma) = \frac{1}{\sigma^2} + 1$

This normalization ensures stability in Denoising score-matching and addresses the train-inference mismatch for high-dimensional video data.

Joint Image-Video Training

Treating images as infinite framerate ( $T\to\infty$ ) videos allows for seamless inclusion of images in the training distribution, further stabilizing training and improving modality generalization. Variable framerate sampling during training ensures robust behavior across different video lengths and types.

4. Handling Redundancy: Joint Spatiotemporal Compression

Inspired by classical video codecs, Snap Video explicitly compresses redundancy in both space and time via its transformer design. Key aspects:

Patch Construction: Patches are single-frame spatial blocks, so temporal modeling is performed via the attention mechanism, not spatial stacking.
Patch Grouping: For cross-attention, all patches from all frames are considered jointly—temporal structure is preserved and exploited globally.
Latent Space Expansion: Larger numbers of latent tokens (e.g., 768 or more) enable the model to encode complex temporal relations, e.g., object motion, scene changes, and camera movement.
Efficient Parameter Allocation: By concentrating representational power in the latent space and minimizing per-frame spatial computation, both training and inference are optimized for the high-redundancy video domain.

This design eliminates per-frame processing bottlenecks and allows substantial growth in model capacity and input video length/resolution.

5. Training Protocol and Prompt Conditioning

Optimizer: LAMB, supporting very large batch sizes (up to 4096 combined samples).
Scheduling: Cosine LR decay, 550k training steps, over 2.25B training instances.
EMA and Dropout: Standard techniques for model stability at large scale.
Conditioning: Text prompts, noise levels, framerate, and resolution are provided as conditioning vectors to the transformer, supporting flexible prompt-to-video mapping and dynamic guidance during synthesis.
Classifier-Free Guidance and Dynamic Thresholding: Prompt fidelity is controlled via dynamic thresholded classifier-free guidance and oscillation, mitigating over-saturation and drift during sampling.

6. Quantitative Benchmarks and Qualitative Evaluation

Benchmarks

Datasets: UCF101 (action video) and MSR-VTT (captioned video).
Metrics: FVD (Fréchet Video Distance), FID (Fréchet Inception Distance), IS (Inception Score), CLIPSIM/CLIP-FID (CLIP-based alignment).
Results:
- UCF101: FVD 200.2, FID 28.1, IS 38.89
- MSR-VTT: CLIP-FID 9.35, FVD 104.0, CLIPSIM 0.2793
- In all cases, Snap Video outperforms or matches state-of-the-art models including Make-A-Video, PYoCo, Video LDM, and Floor33.

User Studies

Photorealism: Snap Video is rated comparable or better than Gen-2.
Text Alignment: Preferred over Gen-2 (81%), Pika (80%), and Floor33 (81%).
Motion Fidelity/Quantity: Snap Video is preferred by 85–96% margin for true temporal consistency.
Artifact Reduction: Flickering and dynamic image artifacts, which are prevalent in other models, are substantially reduced.

Qualitative Analysis

Snap Video produces temporally consistent, high-motion, and semantically accurate videos for a wide range of prompts, including artistic styles and synthesized camera movements. The model implicitly captures 3D geometry and handles novel view synthesis via its spatiotemporal representation.

7. Innovations, Limitations, and Prospects

Key Technical Contributions

FIT transformer backbone: Provides scalable, efficient global spatiotemporal modeling.
EDM adaptation with input scaling: Ensures SNR and noise scheduling are preserved across high-dimensional video domains.
Joint image-video training: Prevents modality mismatch and extends representations.
Empirical SOTA: Sets new benchmarks on public datasets and in blinded human evaluations for both motion quality and text-video alignment.

Limitations

While scalable, extremely high spatial-temporal resolutions will be ultimately limited by hardware constraints.
Moiré and rare failure cases are not explicitly addressed, though not highlighted as a prevailing artifact in reported results.

Plausible Implication

Continuous advances in FIT scaling, diffusion process refinement, and dataset expansion are likely to further enhance Snap Video's generalization, sample diversity, and temporal realism—potentially bridging the gap between generative models and high-fidelity video production pipelines in both research and applied settings.

References

See (Menapace et al., 22 Feb 2024) for technical and architectural details, empirical results, and ablation studies.

PDF Markdown Chat (Pro)

References (1)

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Snap Video.