VPTR-PAR: Efficient Video Prediction
- VPTR-PAR is a partially autoregressive video prediction model that uses separated spatial and temporal attention to balance causality and computational efficiency.
- The architecture leverages VidHRFormer blocks in an encoder-decoder structure with causal masking and cross-attention to integrate past frame features for future predictions.
- Empirical results on KTH and BAIR datasets demonstrate competitive fidelity with a 1.2× speedup over full autoregressive designs, confirming its practical efficiency.
A partially autoregressive model (VPTR-PAR) is a video prediction architecture based on efficient Transformer modules that achieves a balance between the strong temporal modeling capabilities of full autoregressive (AR) models and the speed advantages of non-autoregressive (NAR) approaches. VPTR-PAR is designed to predict the future evolution of video frames by leveraging separated spatial and temporal attention mechanisms while applying causality constraints only in the temporal dimension. This yields competitive video prediction fidelity with faster inference than fully AR designs, making VPTR-PAR suitable for high-throughput predictive applications in computer vision and robotics (Ye et al., 2022).
1. Model Architecture and Attention Mechanisms
VPTR-PAR employs a two-part encoder–decoder structure, both constructed from VidHRFormer blocks specifically designed for efficient local spatial–temporal factorization. The encoder processes the features of past frames, while the decoder generates future-frame features autoregressively, conditioning on encoded history and previously generated outputs.
Key characteristics:
- Encoder (): Receives latent features of preceding frames.
- Decoder (): Predicts future frames, leveraging teacher forcing during training for parallelization but reverting to stepwise generation in inference. A causal mask restricts temporal self-attention in , ensuring each prediction can only utilize currently available or earlier frame features.
- VidHRFormer Block: The central module alternates (a) local spatial multi-head self-attention (MHSA) within non-overlapping patches of each frame, (b) convolutional feed-forward networks for spatial feature mixing, and (c) global temporal MHSA over frame sequences—with the optional addition of learned relative positional encoding.
- Cross-Attention in Decoder: Before the last spatial feed-forward step, the decoder’s latent tokens attend to encoder outputs, integrating summary information from the encoded past.
This architecture allows parallel processing in spatial regions while enforcing temporal causality in generation.
2. Mathematical Formulation
The VPTR-PAR model factorizes the video prediction task as:
with the conditional modeled autoregressively:
or, in chunked notation,
0
The attention operations are defined for both self- and cross-attention layers:
1
Spatial positional encodings may be absolute (sin-cos) or relative within patches; temporal encodings are fixed absolute 1D.
3. Training Objectives and Loss Functions
Training minimizes a composite loss over future frame predictions:
- Mean-squared pixel error (MSE):
2
- Gradient-difference loss (GDL):
3
The total loss:
4
No auxiliary losses are employed beyond this sum (Ye et al., 2022).
4. Inference Workflow and Causality Controls
At inference, the “recurrent over pixel” (RIP) approach is adopted to guard against feature-space drift. The pipeline proceeds as follows:
- Past frames are encoded via a CNN.
- The encoder output passes through 5 to yield a temporal/spatial memory.
- For each future time step 6, the decoder receives the most recent encoded latent feature, along with previously predicted future latents, using one forward pass for each step. The prediction 7 is mapped back to the pixel domain and re-encoded before being appended to the autoregressive context.
The sequence is processed in parallel within each pass, but causality is enforced by a mask that prohibits decoder attention to future steps.
VPTR-PAR enables a notable inference speedup compared to full autoregressive designs requiring 8 passes across the full stack of transformer blocks.
5. Computational Complexity and Efficiency
The design of VPTR-PAR reduces the quadratic complexity of standard Transformers across the spatiotemporal flattening. A standard Transformer with sequence length 9 costs 0, while a VidHRFormer block achieves:
1
where 2 is the number of non-overlapping patches per frame. Over 3 time steps, the full AR method accumulates substantial overhead; the partial AR approach circumvents this by amortizing multi-frame processing and parallel computation within each forward pass. On standard datasets, VPTR-PAR is approximately 1.2× faster than its fully AR counterpart, and significantly more efficient than ConvLSTM-based baselines (Ye et al., 2022).
6. Empirical Results and Ablations
Quantitative assessment on KTH and BAIR motion datasets demonstrates that VPTR-PAR achieves performance closely matching full AR versions in most metrics (PSNR, SSIM, LPIPS), while providing meaningful acceleration:
| Method | KTH PSNR | KTH SSIM | KTH LPIPS | BAIR PSNR | BAIR SSIM | BAIR LPIPS |
|---|---|---|---|---|---|---|
| VPTR-PAR | 25.40 | 0.836 | 0.0848 | 15.94 | 0.745 | 0.1048 |
| VPTR-FAR | 26.13 | 0.859 | 0.0796 | 15.76 | 0.724 | 0.1107 |
| VPTR-NAR | 26.96 | 0.879 | 0.0861 | 17.77 | 0.813 | 0.0700 |
Ablations confirm that the separation of spatial and temporal attention is sufficient for effective modeling; removal of the RIP step results in severe degradation (LPIPS ≈ 0.193 vs. 0.0848), highlighting the importance of mitigating latent drift via pixel-space re-encoding (Ye et al., 2022).
7. Implementation and Practical Details
Salient implementation parameters:
- Latent features: 4, 5, 6
- Patch size 7 (8 patches per frame)
- Encoder: 4 VidHRFormer layers; Decoder: 8 layers + cross-attention and additional feed-forward
- 8-head MHSA with per-head dimension 9
- AdamW optimizer, learning rate 0, gradient clipping
- Absolute 2D/1D positional encoding and optional learned RPE (+0.3 dB PSNR)
- Source code and pretrained models are available at https://github.com/XiYe20/VPTR
VPTR-PAR, as a partially autoregressive variant, demonstrates that efficient transformer architectures can closely approach or match the performance of fully AR models in video prediction with materially improved efficiency and inference speed, validating the utility of spatial–temporal factorization and causally masked decoding (Ye et al., 2022).