Papers
Topics
Authors
Recent
Search
2000 character limit reached

VPTR-FAR: Fully Autoregressive Video Prediction

Updated 14 May 2026
  • The paper presents a fully autoregressive model for video prediction that leverages a two-stage design combining a frozen frame autoencoder with VidHRFormer blocks.
  • It innovatively factorizes self-attention into spatial-local and temporal segments, reducing computational complexity by an order of magnitude compared to naïve transformers.
  • Experimental results on benchmarks like KTH and BAIR confirm competitive or state-of-the-art video prediction performance with improved efficiency over traditional models.

VPTR-FAR—Fully Autoregressive Model—is a family of transformer-based architectures for video prediction that implements full latent autoregression using efficient spatial-temporal separation in self-attention. Developed within the VPTR framework, it addresses computational challenges of video sequence modeling by decomposing attention and adopting continuous latent predictions. The term "FAR" (also used in the image domain for "Frequency Autoregressive" autoregressive schemes) in the context of VPTR refers specifically to fully autoregressive latent sequence prediction for future video frames, built atop novel attention mechanisms that achieve orders-of-magnitude efficiency gains over naïve transformers. VPTR-FAR delivers state-of-the-art or competitive results on standard conditional video prediction benchmarks while being substantially more efficient than classical ConvLSTM or flattened token-based transformer counterparts (Ye et al., 2022).

1. Architectural Foundations: Two-Stage Latent Prediction

VPTR-FAR follows a two-stage design paradigm:

  1. Frame Autoencoder: A frame-level encoder–decoder is trained to map video frames xRH0×W0×3x \in \mathbb{R}^{H_0 \times W_0 \times 3} into a latent representation zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}, and back. The autoencoder is frozen post-training.
  2. Latent Autoregression: The core VPTR-FAR model is a stack of NN_{\ell} VidHRFormer blocks. These blocks alternate:
    • Local spatial MHSA (multi-head self-attention) within non-overlapping K×KK \times K patches.
    • Convolutional feed-forward networks (Conv-FFN).
    • Temporal MHSA across TT frames, with a causal mask ensuring each frame tt only attends to frames up to tt (strictly autoregressive).
    • Standard MLP feed-forwards.

At inference, given latents z1,...,zLz_1, ..., z_L, the model predicts future latents z^L+1,...,z^L+N\hat z_{L+1}, ..., \hat z_{L+N}, decoding each to frames x^t=Dec(z^t)\hat x_t = \text{Dec}(\hat z_t). Two recursion variants are evaluated: RIP (recurrent over pixel space, involving decode+re-encode) and RIL (recurrent in latent space). RIP is empirically found to accumulate error more slowly than RIL (Ye et al., 2022).

2. Attention Mechanism and Computational Efficiency

The VidHRFormer block at the heart of VPTR-FAR adopts a spatiallocal plus temporalseparate attention factorization:

  • Local Spatial MHSA: For each frame, the latent grid is divided into zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}0 patches of size zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}1. Within each patch, tokens participate in conventional MHSA with standard query/key/value projections and no spatial mask.
  • Temporal MHSA: All spatial positions at a pixel location across zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}2 frames undergo MHSA with a causal mask:

zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}3

This allows a full factorization zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}4.

  • Complexity Reduction: The overall per-block cost is

zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}5

which is significantly lower than the zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}6 cost of a naïve spatio-temporal transformer. For practical settings (zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}7), the reduction is by an order of magnitude (Ye et al., 2022).

3. Training Objective and Autoregressive Inference

VPTR-FAR employs teacher-forced training in latent space:

  • At each training iteration, ground-truth past latents are provided to predict the next.
  • Each predicted latent zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}8 is decoded to zRH×W×dmodelz \in \mathbb{R}^{H \times W \times d_{\text{model}}}9; pixel-space supervision is applied.
  • The loss comprises:

    • Mean-squared error:

    NN_{\ell}0 - Gradient difference loss (GDL) as in Mathieu et al., for sharpening predictions:

    NN_{\ell}1 - The total loss is:

    NN_{\ell}2

(Ye et al., 2022)

True autoregressive prediction is performed at inference by recursively generating latents, optionally using RIP (decode and re-encode) or RIL (direct latent recursion). The causal masking in the temporal attention ensures strict adherence to the autoregressive factorization.

4. Quantitative Results and Benchmarks

VPTR-FAR demonstrates competitive or state-of-the-art results on established video prediction datasets, often matching or surpassing convolutional LSTM models with far greater efficiency:

Dataset PSNR (dB) SSIM LPIPS (×10⁻³) Key Results
KTH (10→20) 26.13 0.859 0.0796 Outperforms/matches ConvLSTM SOTA in LPIPS, competitive in PSNR/SSIM
Moving MNIST 0.844 0.1578 Quality drop with digit overlap, generally matches baselines
BAIR (2→28) 15.76 0.724 0.1107 Comparable to early stochastic models (SVG-LP, SV2P) despite being deterministic

Early frames predicted by VPTR-FAR are notably sharp; quality degrades for later frames due to error accumulation intrinsic to autoregressive inference, especially pronounced with direct latent recursion. RIP inference accumulates error notably more slowly (Ye et al., 2022).

5. Comparison with Other Autoregressive and Non-autoregressive Schemes

VPTR-FAR should be contrasted with:

  • Partial and Non-autoregressive variants: While partial autoregressive models (e.g., conditioning only part of the future sequence) achieve similar performance at lower latency, non-autoregressive models mitigate the blurring artifacts common in AR approaches but impose additional modeling and loss complexities. Exhaustive experiments reveal that under identical spatial-temporal attention, VPTR-FAR is competitive or superior to these alternatives for various datasets (Ye et al., 2022).
  • Naïve Flattened Transformers: VPTR-FAR's spatial-local and temporal-separate attention block provides significant gains in memory and FLOPs, enabling much deeper or larger models within the same computational budget.
  • Continuous-Latent Autoregressive Models in the Image Domain: While distinct in domain and specific methodology, the general principles of continuous-token AR prediction underpin both VPTR-FAR (video) (Ye et al., 2022) and frequency-autoregressive image models (e.g., (Yu et al., 7 Mar 2025)), though the latter explicitly leverage spectral ordering in image generation.

6. Empirical Tradeoffs, Design Limitations, and Extensions

While VPTR-FAR achieves efficient autoregressive video prediction, key limitations and practical tradeoffs include:

  • Autoregressive Error Accumulation: Quality decays for later predicted frames due to error propagation, a ubiquitous issue in autoregressive generative models. RIP mitigates but does not eliminate this effect.
  • Frame Sharpness and Temporal Consistency: Early predictions are crisp, but as autoregressive steps proceed, reconstructions blur, and temporal drift may appear, especially under latent recursion.
  • Model Scope: The approach predicts strictly in the autoencoder's latent space, and performance is bounded by the expressiveness and fidelity of that representation.
  • Partial and Non-AR Variants: Partial-autoregressive variants can be preferable where inference speed is a constraint, while non-autoregressive extensions may yield higher perceptual quality for longer rollouts at the cost of added parameters and loss complexity.
  • Efficiency: The explicit separation of spatial and temporal attention is integral; naive adoption of spatio-temporal self-attention at frame-token level is computationally infeasible except for very low-resolution input.

VPTR-FAR sets a benchmark for efficient and principled autoregressive video prediction by combining autoregressive latent modeling, an efficient VidHRFormer attention block, and straightforward pixel-level objectives. These principles remain influential for subsequent advancements and cross-modal generative modeling architectures (Ye et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VPTR-FAR (Fully Autoregressive Model).