Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Snap Video: Scalable Text-to-Video Synthesis

Updated 5 November 2025
  • Snap Video is a generative text-to-video synthesis framework that integrates spatiotemporal transformers with a custom diffusion process to achieve temporally consistent, high-fidelity videos.
  • The framework employs a Far-reaching Interleaved Transformer (FIT) backbone to efficiently compress spatial and temporal redundancies, enabling scalable training and faster inference.
  • Innovations such as input scaling and joint image-video training address SNR preservation and modality mismatches, setting a new benchmark in text-to-video generation quality.

Snap Video refers to a generative modeling framework and model family for text-to-video synthesis, built around large-scale spatiotemporal transformers and an extended diffusion process specifically designed for videos. The framework systematically addresses the efficiency and quality limitations of prior video generation approaches, moving beyond adaptations of image models by introducing architectural and algorithmic innovations that enable scalable, high-fidelity, and temporally consistent video synthesis. The model is characterized by its transformer-based Far-reaching Interleaved Transformer (FIT) backbone, a principled extension of the EDM (Elucidating the Design Space of Diffusion Models) framework to the video domain, and architectures tailored for exploiting spatial and temporal redundancy in video data (Menapace et al., 22 Feb 2024).

1. Motivations and Core Challenges

Text-to-video synthesis presents unique algorithmic and computational challenges. Prior methods often repurpose still-image diffusion models (U-Nets) and extend them with basic temporal modules; however, this approach is fundamentally misaligned with video’s high spatial and temporal redundancy. The key domain challenges are:

  • Spatiotemporal Redundancy: Consecutive frames exhibit strong correlations, making naive spatial processing redundant and inefficient.
  • Scalability Constraints: U-Net-based networks scale linearly in both compute and memory with increasing video length and resolution, severely limiting batch size, parameter count, and training throughput for video.
  • Temporal Coherence: Many models based on frame-wise image synthesis produce dynamic images rather than intrinsic videos, resulting in flickering or inconsistent object motion.
  • Train-Inference Mismatch: Existing diffusion training paradigms, when naively applied to videos, lead to undertrained high-frequency motion representations due to the aggregation of noise over many frames ("averaging effect").

Snap Video was designed to address these domain-specific inefficiencies via video-first architectural design and revised generative training objectives (Menapace et al., 22 Feb 2024).

2. Spatiotemporal Transformer Architecture (FIT) and Scalability

Far-reaching Interleaved Transformer (FIT)

The core innovation is the adoption of a transformer-based architecture that handles video as a joint sequence of spatiotemporal tokens:

  • Patch Tokenization: Input video is split into non-overlapping spatial patches per frame (e.g., 4×4 patches, flattened across all frames → sequence length proportional to THWT H W).
  • Latent Tokens: A fixed set of learnable latent tokens (e.g., 768 for the 3.9B-parameter model) absorb and process patch information.
  • Interleaved Processing: Each FIT block consists of:
    • Cross-attention from patch tokens to latent tokens (read operations),
    • Self-attention among latent tokens (global context),
    • Feedforward updates to both patch and latent tokens,
    • Optionally, conditioning on prompt, noise, and video metadata.
  • Self-conditioning: Latent tokens maintain compressed, decodable representations across denoising steps, facilitating efficient iterative refinement.

Architecture Hyperparameters (Select Configurations):

Model Size Patch Size Latent Tokens Input Resolution
3.9B 1×4×4 768 16 × 512 × 288 px video
500M 1×4×4 512 16 × 64 × 40 px video

The FIT design enables global, joint spatiotemporal modeling at scale, avoiding the inefficiencies of U-Net's per-frame spatial computation and limited temporal communication.

Scalability and Efficiency

  • Training: FIT models train 3.31× faster than comparably sized U-Nets; batch sizes of 2048 videos and 2048 images are achievable, supporting model scaling to billions of parameters.
  • Inference: FIT is 4.49× faster than a U-Net of similar size at inference for video sequences of the same length and resolution. For a 3.9B-parameter FIT, inference is only ~1.24× slower than a 500M U-Net.
  • Resource Allocation: By concentrating computation in a compressible latent space, FIT supports larger model and batch sizes without quadratic memory scaling in sequence length (Menapace et al., 22 Feb 2024).

3. Diffusion Process: EDM Adaptation for Video

EDM Forward Process and Problematic SNR

The EDM forward process for images is: p(xσx)N(x,σ2I)p(x_{\sigma}|x) \sim \mathcal{N}(x, \sigma^2 I) with loss: L(D)=Eσ,x,n[λ(σ)D(xσ)x22]\mathcal{L}(D) = \mathbb{E}_{\sigma, x, n} \left[ \lambda(\sigma) \| D(x_{\sigma}) - x \|^2_2 \right] Directly extending this to video leads to a discrepancy: as the number of frames and spatial resolution increase, the effective SNR rises (due to averaging of noise across many correlated pixels), making the denoising task artificially easier during training compared to inference.

Input Scaling

Snap Video introduces an input scaling factor: α=sT\alpha = s \sqrt{T} where ss is the spatial upsampling ratio and TT is the number of frames. The forward diffusion process becomes: p(xσx)N(xα,σ2I)p(x_{\sigma}|x) \sim \mathcal{N}\left(\frac{x}{\alpha}, \sigma^2 I\right) This preserves per-location SNR as video resolution or frame count increases, maintaining faithful modeling of the underlying data distribution.

The revised denoising objective is: L(F)=Eσ,x,n[w(σ)F(in(σ)xσ)nrm(σ)F22]\mathcal{L}(F) = \mathbb{E}_{\sigma, x, n} \left[ w(\sigma) \left\| F(\text{in}(\sigma)x_{\sigma}) - \text{nrm}(\sigma)F \right\|_2^2 \right] where

in(σ)=1σ2/α2+1nrm(σ)=1σ2+1w(σ)=1σ2+1\text{in}(\sigma) = \frac{1}{\sqrt{\sigma^2 / \alpha^2 + 1}} \qquad \text{nrm}(\sigma) = \frac{1}{\sqrt{\sigma^2 + 1}} \qquad w(\sigma) = \frac{1}{\sigma^2} + 1

This normalization ensures stability in Denoising score-matching and addresses the train-inference mismatch for high-dimensional video data.

Joint Image-Video Training

Treating images as infinite framerate (TT\to\infty) videos allows for seamless inclusion of images in the training distribution, further stabilizing training and improving modality generalization. Variable framerate sampling during training ensures robust behavior across different video lengths and types.

4. Handling Redundancy: Joint Spatiotemporal Compression

Inspired by classical video codecs, Snap Video explicitly compresses redundancy in both space and time via its transformer design. Key aspects:

  • Patch Construction: Patches are single-frame spatial blocks, so temporal modeling is performed via the attention mechanism, not spatial stacking.
  • Patch Grouping: For cross-attention, all patches from all frames are considered jointly—temporal structure is preserved and exploited globally.
  • Latent Space Expansion: Larger numbers of latent tokens (e.g., 768 or more) enable the model to encode complex temporal relations, e.g., object motion, scene changes, and camera movement.
  • Efficient Parameter Allocation: By concentrating representational power in the latent space and minimizing per-frame spatial computation, both training and inference are optimized for the high-redundancy video domain.

This design eliminates per-frame processing bottlenecks and allows substantial growth in model capacity and input video length/resolution.

5. Training Protocol and Prompt Conditioning

  • Optimizer: LAMB, supporting very large batch sizes (up to 4096 combined samples).
  • Scheduling: Cosine LR decay, 550k training steps, over 2.25B training instances.
  • EMA and Dropout: Standard techniques for model stability at large scale.
  • Conditioning: Text prompts, noise levels, framerate, and resolution are provided as conditioning vectors to the transformer, supporting flexible prompt-to-video mapping and dynamic guidance during synthesis.
  • Classifier-Free Guidance and Dynamic Thresholding: Prompt fidelity is controlled via dynamic thresholded classifier-free guidance and oscillation, mitigating over-saturation and drift during sampling.

6. Quantitative Benchmarks and Qualitative Evaluation

Benchmarks

  • Datasets: UCF101 (action video) and MSR-VTT (captioned video).
  • Metrics: FVD (Fréchet Video Distance), FID (Fréchet Inception Distance), IS (Inception Score), CLIPSIM/CLIP-FID (CLIP-based alignment).
  • Results:
    • UCF101: FVD 200.2, FID 28.1, IS 38.89
    • MSR-VTT: CLIP-FID 9.35, FVD 104.0, CLIPSIM 0.2793
    • In all cases, Snap Video outperforms or matches state-of-the-art models including Make-A-Video, PYoCo, Video LDM, and Floor33.

User Studies

  • Photorealism: Snap Video is rated comparable or better than Gen-2.
  • Text Alignment: Preferred over Gen-2 (81%), Pika (80%), and Floor33 (81%).
  • Motion Fidelity/Quantity: Snap Video is preferred by 85–96% margin for true temporal consistency.
  • Artifact Reduction: Flickering and dynamic image artifacts, which are prevalent in other models, are substantially reduced.

Qualitative Analysis

  • Snap Video produces temporally consistent, high-motion, and semantically accurate videos for a wide range of prompts, including artistic styles and synthesized camera movements. The model implicitly captures 3D geometry and handles novel view synthesis via its spatiotemporal representation.

7. Innovations, Limitations, and Prospects

Key Technical Contributions

  • FIT transformer backbone: Provides scalable, efficient global spatiotemporal modeling.
  • EDM adaptation with input scaling: Ensures SNR and noise scheduling are preserved across high-dimensional video domains.
  • Joint image-video training: Prevents modality mismatch and extends representations.
  • Empirical SOTA: Sets new benchmarks on public datasets and in blinded human evaluations for both motion quality and text-video alignment.

Limitations

  • While scalable, extremely high spatial-temporal resolutions will be ultimately limited by hardware constraints.
  • Moiré and rare failure cases are not explicitly addressed, though not highlighted as a prevailing artifact in reported results.

Plausible Implication

Continuous advances in FIT scaling, diffusion process refinement, and dataset expansion are likely to further enhance Snap Video's generalization, sample diversity, and temporal realism—potentially bridging the gap between generative models and high-fidelity video production pipelines in both research and applied settings.

References

See (Menapace et al., 22 Feb 2024) for technical and architectural details, empirical results, and ablation studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Snap Video.