Papers
Topics
Authors
Recent
Search
2000 character limit reached

Phenaki: Text-Conditioned Video Synthesis

Updated 22 February 2026
  • Phenaki is a generative modeling system that produces open-domain videos from evolving text prompts, enabling coherent visual storytelling.
  • It employs a causal, temporally-factorized video tokenizer alongside a bidirectional masked Transformer to iteratively predict and refine video token grids.
  • The approach supports segmental generation with re-encoding for smooth scene transitions and demonstrates strong performance in metrics like FVD and FID.

Phenaki is a generative modeling system for producing open-domain videos of variable length, directly conditioned on sequences of textual prompts. It introduces architectural and algorithmic innovations in video tokenization and text-to-video synthesis that enable spatio-temporally coherent generation over arbitrary durations and evolving prompts. Phenaki is distinguished by its causal, temporally-factorized video tokenizer and a bidirectional masked Transformer, integral for efficient and generalizable open-domain video generation from text (Villegas et al., 2022).

1. Model Architecture and Generation Pipeline

Phenaki's generative process bifurcates into two principal components:

  1. Discrete Video Tokenizer (C): An input video is encoded into a compact, variable-length 3D tensor of discrete codebook indices via a novel causal transformer-based tokenization scheme. The decoder reconstructs RGB video from these discrete tokens.
  2. Bidirectional Masked Transformer (MaskGIT): A 24-layer Transformer predicts sequences of video tokens conditioned on text embeddings derived from a frozen T5X encoder. Generation proceeds by iteratively sampling masked tokens, using classifier-free guidance for improved diversity and control.

The generation workflow operates as follows. At inference, the MaskGIT model predicts a full grid of masked video tokens over tens of refinement steps. The C-decoder transforms these discrete tokens into video frames. For long or multi-scene generation, a segmental approach is used: after generating each segment, the last KK frames are re-encoded to provide temporal context for subsequent prompts, enabling seamless transitions and "visual stories" with evolving textual guidance.

2. Discrete Video Tokenizer: Architecture, Losses, and Temporal Factorization

The C tokenizer processes an input xR(Tx+1)×Hx×Wx×Cx\mathbf{x}\in\mathbb{R}^{(T_x+1)\times H_x\times W_x\times C_x}, e.g., 11 frames at 128×128×3128\times128\times 3, via patchification and stacked transformer blocks:

  • Patchification: The first frame is decomposed into 8×8×38\times8\times3 2D patches; subsequent frames are decomposed into 2×8×8×32\times8\times8\times3 spatio-temporal patches.
  • Transformer Blocks: An initial spatial transformer block (4 layers, all-to-all attention) operates at each time index independently, followed by a temporal block (4 layers, causal self-attention in time) applied per patch spatial location. Causality is enforced by masking future frames in the temporal attention matrix.
  • Vector Quantization: Output vectors z(τ,j)Rdzz(\tau,j)\in\mathbb{R}^{d_z} are quantized to a codebook (E=8192|E|=8192 entries), yielding discrete indices for each patch and time position.

Loss Function: The tokenizer is trained with a composite objective: LC=LVQ+λ2xx^22+λIPLIP(x,x^)+λVPLVP(x,x^)+λAdvLAdv(x^)\mathcal{L}_{C} = \mathcal{L}_{VQ} + \lambda_2 \|x-\hat{x}\|_2^2 + \lambda_{IP}\mathcal{L}_{IP}(x,\hat{x}) + \lambda_{VP}\mathcal{L}_{VP}(x,\hat{x}) + \lambda_{Adv}\mathcal{L}_{Adv}(\hat{x}) where LVQ\mathcal{L}_{VQ} is the vector-quantization loss, LIP\mathcal{L}_{IP} and LVP\mathcal{L}_{VP} are perceptual losses via VGG and I3D, and LAdv\mathcal{L}_{Adv} is an adversarial loss with StyleGAN-style discriminator. Default weighting is (λ2,λIP,λVP,λAdv)=(1.0,0.1,1.0,0.1)(\lambda_2, \lambda_{IP}, \lambda_{VP}, \lambda_{Adv}) = (1.0, 0.1, 1.0, 0.1). The temporal factorization with causal self-attention ensures the tokenizer supports variable-length video encoding and decoding.

3. Text Conditioning and MaskGIT Modeling

Text prompts are tokenized using a frozen T5X encoder (vocabulary size ≈32,000, output embedding length L64L \approx 64, dimension 1024). The bidirectional masked Transformer (MaskGIT-style) receives these embeddings via learned cross-attention inserted after standard self-attention at each layer.

Masked visual token modeling is used for training: for each example, a random mask ratio γi[0,1]\gamma_i\in[0,1] is sampled, and γiN\lceil \gamma_i N \rceil tokens are masked (out of NN total). The transformer minimizes the negative log-likelihood of the true tokens at masked positions, conditioned on visible positions and text: Lmask=i:mi=1logp(aiaMˉ,text)\mathcal{L}_{mask} = -\sum_{i: m_i=1} \log p(a_i \mid a_{\bar{M}},\mathrm{text}) where aia_i is the true code index. Approximately 10% of steps drop the text condition, facilitating classifier-free guidance at inference.

Inference proceeds via parallel denoising iterations: at each step, logits for masked tokens are predicted; the highest-confidence predictions are kept and the remainder are re-masked, iteratively refining the grid until all tokens are filled. This produces improved sample efficiency over autoregressive decoding and, when paired with causal tokenization, allows for efficient long-form or multi-scene generation.

4. Data, Joint Training, and Style Transfer

Phenaki is trained on approximately 15 million video-text clips at 8 fps and two large image–text datasets (WebVid+LAION-400M, collectively ≈450M pairs). Each batch contains about 80% video and 10% from each image pool (batch size 512).

A dynamic masking strategy is applied: image data are treated as 1-frame “videos,” and only their spatial tokens are masked and predicted, skipping temporal positions. This joint training enables strong transfer of image-derived styles (e.g., “pencil sketch”) to video synthesis, enhancing visual diversity and controllability in open-domain prompts. Mixing ratios impact trade-offs between temporal dynamics (FVD) and appearance fidelity (FID_img).

5. Variable-Length Generation and Story Prompting

The causal time-factorized tokenizer enables encoding any number of context frames. To build long or multi-scene outputs, the pipeline alternates prompt-conditioned MaskGIT sampling and re-encoding of recent frames for continuity. The process:

  1. Prompt p0p_0 → sample L0L_0 frames.
  2. After KK frames, switch to prompt p1p_1.
  3. Re-encode last KK frames with CC, append new prompt, and generate L1L_1 frames.

This enables temporally coherent visual stories, including smooth morphing between distinct prompt semantics, with transitions such as "teddy bear → panda." Finite context (fixed KK) may affect very long-range consistency.

6. Evaluation and Benchmarks

Quantitative evaluation is conducted using established metrics:

  • Video Reconstruction (Quantization) on Moments-In-Time: C tokenizer achieves FVD=65.8 (vs. 166 for per-frame ViT-VQGAN) with 1536 tokens per clip, compared to 2560 tokens for baselines. There is a small penalty in FID but substantial gain in spatio-temporal consistency and token efficiency.
  • Text-to-Video (Kinetics-400, zero-shot): Phenaki achieves FID_vid ≈ 3.8, outperforming NUWA (7.0), and FID_img ≈ 37 for open-domain, zero-shot generation.
  • Video Prediction (BAIR Robot Push, Kinetics-600): Despite not being specialized for this task, Phenaki is competitive with state-of-the-art, as measured by FVD.
  • Text-to-Image: CLIP score and FID on held-out LAION images demonstrate transferable style learning from the image pools.

Ablation studies reveal trade-offs: pure video training yields lowest FVD but highest FID_img; mixed batches (e.g., 80/20) improve appearance but slightly degrade temporal metrics.

7. Limitations and Prospective Directions

Phenaki's performance on visual fidelity and coherence currently trails the photorealistic capabilities of static image generators, with observable artifacts in complex or high-motion scenes. The approach is constrained by a finite context window KK, leading to gradual degradation in very long temporal continuity. Higher resolutions and frame rates incur substantial computational costs. The model's reliance on large, unfiltered web-scale corpora introduces bias and content safety issues.

Potential enhancements include the development of more powerful long-range temporal models (e.g., hierarchical or memory-augmented transformers), improved dataset curation and safety mechanisms, and extensions to audio-video, interactive, or key-frame-controllable synthesis. These avenues may address both qualitative limitations and application scalability (Villegas et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Phenaki.