Phenaki: Text-Conditioned Video Synthesis
- Phenaki is a generative modeling system that produces open-domain videos from evolving text prompts, enabling coherent visual storytelling.
- It employs a causal, temporally-factorized video tokenizer alongside a bidirectional masked Transformer to iteratively predict and refine video token grids.
- The approach supports segmental generation with re-encoding for smooth scene transitions and demonstrates strong performance in metrics like FVD and FID.
Phenaki is a generative modeling system for producing open-domain videos of variable length, directly conditioned on sequences of textual prompts. It introduces architectural and algorithmic innovations in video tokenization and text-to-video synthesis that enable spatio-temporally coherent generation over arbitrary durations and evolving prompts. Phenaki is distinguished by its causal, temporally-factorized video tokenizer and a bidirectional masked Transformer, integral for efficient and generalizable open-domain video generation from text (Villegas et al., 2022).
1. Model Architecture and Generation Pipeline
Phenaki's generative process bifurcates into two principal components:
- Discrete Video Tokenizer (C): An input video is encoded into a compact, variable-length 3D tensor of discrete codebook indices via a novel causal transformer-based tokenization scheme. The decoder reconstructs RGB video from these discrete tokens.
- Bidirectional Masked Transformer (MaskGIT): A 24-layer Transformer predicts sequences of video tokens conditioned on text embeddings derived from a frozen T5X encoder. Generation proceeds by iteratively sampling masked tokens, using classifier-free guidance for improved diversity and control.
The generation workflow operates as follows. At inference, the MaskGIT model predicts a full grid of masked video tokens over tens of refinement steps. The C-decoder transforms these discrete tokens into video frames. For long or multi-scene generation, a segmental approach is used: after generating each segment, the last frames are re-encoded to provide temporal context for subsequent prompts, enabling seamless transitions and "visual stories" with evolving textual guidance.
2. Discrete Video Tokenizer: Architecture, Losses, and Temporal Factorization
The C tokenizer processes an input , e.g., 11 frames at , via patchification and stacked transformer blocks:
- Patchification: The first frame is decomposed into 2D patches; subsequent frames are decomposed into spatio-temporal patches.
- Transformer Blocks: An initial spatial transformer block (4 layers, all-to-all attention) operates at each time index independently, followed by a temporal block (4 layers, causal self-attention in time) applied per patch spatial location. Causality is enforced by masking future frames in the temporal attention matrix.
- Vector Quantization: Output vectors are quantized to a codebook ( entries), yielding discrete indices for each patch and time position.
Loss Function: The tokenizer is trained with a composite objective: where is the vector-quantization loss, and are perceptual losses via VGG and I3D, and is an adversarial loss with StyleGAN-style discriminator. Default weighting is . The temporal factorization with causal self-attention ensures the tokenizer supports variable-length video encoding and decoding.
3. Text Conditioning and MaskGIT Modeling
Text prompts are tokenized using a frozen T5X encoder (vocabulary size ≈32,000, output embedding length , dimension 1024). The bidirectional masked Transformer (MaskGIT-style) receives these embeddings via learned cross-attention inserted after standard self-attention at each layer.
Masked visual token modeling is used for training: for each example, a random mask ratio is sampled, and tokens are masked (out of total). The transformer minimizes the negative log-likelihood of the true tokens at masked positions, conditioned on visible positions and text: where is the true code index. Approximately 10% of steps drop the text condition, facilitating classifier-free guidance at inference.
Inference proceeds via parallel denoising iterations: at each step, logits for masked tokens are predicted; the highest-confidence predictions are kept and the remainder are re-masked, iteratively refining the grid until all tokens are filled. This produces improved sample efficiency over autoregressive decoding and, when paired with causal tokenization, allows for efficient long-form or multi-scene generation.
4. Data, Joint Training, and Style Transfer
Phenaki is trained on approximately 15 million video-text clips at 8 fps and two large image–text datasets (WebVid+LAION-400M, collectively ≈450M pairs). Each batch contains about 80% video and 10% from each image pool (batch size 512).
A dynamic masking strategy is applied: image data are treated as 1-frame “videos,” and only their spatial tokens are masked and predicted, skipping temporal positions. This joint training enables strong transfer of image-derived styles (e.g., “pencil sketch”) to video synthesis, enhancing visual diversity and controllability in open-domain prompts. Mixing ratios impact trade-offs between temporal dynamics (FVD) and appearance fidelity (FID_img).
5. Variable-Length Generation and Story Prompting
The causal time-factorized tokenizer enables encoding any number of context frames. To build long or multi-scene outputs, the pipeline alternates prompt-conditioned MaskGIT sampling and re-encoding of recent frames for continuity. The process:
- Prompt → sample frames.
- After frames, switch to prompt .
- Re-encode last frames with , append new prompt, and generate frames.
This enables temporally coherent visual stories, including smooth morphing between distinct prompt semantics, with transitions such as "teddy bear → panda." Finite context (fixed ) may affect very long-range consistency.
6. Evaluation and Benchmarks
Quantitative evaluation is conducted using established metrics:
- Video Reconstruction (Quantization) on Moments-In-Time: C tokenizer achieves FVD=65.8 (vs. 166 for per-frame ViT-VQGAN) with 1536 tokens per clip, compared to 2560 tokens for baselines. There is a small penalty in FID but substantial gain in spatio-temporal consistency and token efficiency.
- Text-to-Video (Kinetics-400, zero-shot): Phenaki achieves FID_vid ≈ 3.8, outperforming NUWA (7.0), and FID_img ≈ 37 for open-domain, zero-shot generation.
- Video Prediction (BAIR Robot Push, Kinetics-600): Despite not being specialized for this task, Phenaki is competitive with state-of-the-art, as measured by FVD.
- Text-to-Image: CLIP score and FID on held-out LAION images demonstrate transferable style learning from the image pools.
Ablation studies reveal trade-offs: pure video training yields lowest FVD but highest FID_img; mixed batches (e.g., 80/20) improve appearance but slightly degrade temporal metrics.
7. Limitations and Prospective Directions
Phenaki's performance on visual fidelity and coherence currently trails the photorealistic capabilities of static image generators, with observable artifacts in complex or high-motion scenes. The approach is constrained by a finite context window , leading to gradual degradation in very long temporal continuity. Higher resolutions and frame rates incur substantial computational costs. The model's reliance on large, unfiltered web-scale corpora introduces bias and content safety issues.
Potential enhancements include the development of more powerful long-range temporal models (e.g., hierarchical or memory-augmented transformers), improved dataset curation and safety mechanisms, and extensions to audio-video, interactive, or key-frame-controllable synthesis. These avenues may address both qualitative limitations and application scalability (Villegas et al., 2022).