VPTR-PAR: Efficient Video Prediction

Updated 14 May 2026

VPTR-PAR is a partially autoregressive video prediction model that uses separated spatial and temporal attention to balance causality and computational efficiency.
The architecture leverages VidHRFormer blocks in an encoder-decoder structure with causal masking and cross-attention to integrate past frame features for future predictions.
Empirical results on KTH and BAIR datasets demonstrate competitive fidelity with a 1.2× speedup over full autoregressive designs, confirming its practical efficiency.

A partially autoregressive model (VPTR-PAR) is a video prediction architecture based on efficient Transformer modules that achieves a balance between the strong temporal modeling capabilities of full autoregressive (AR) models and the speed advantages of non-autoregressive (NAR) approaches. VPTR-PAR is designed to predict the future evolution of video frames by leveraging separated spatial and temporal attention mechanisms while applying causality constraints only in the temporal dimension. This yields competitive video prediction fidelity with faster inference than fully AR designs, making VPTR-PAR suitable for high-throughput predictive applications in computer vision and robotics (Ye et al., 2022).

1. Model Architecture and Attention Mechanisms

VPTR-PAR employs a two-part encoder–decoder structure, both constructed from VidHRFormer blocks specifically designed for efficient local spatial–temporal factorization. The encoder $\mathcal{T}_E$ processes the features of past frames, while the decoder $\mathcal{T}_D$ generates future-frame features autoregressively, conditioning on encoded history and previously generated outputs.

Key characteristics:

Encoder ( $\mathcal{T}_E$ ): Receives latent features of $L$ preceding frames.
Decoder ( $\mathcal{T}_D$ ): Predicts $N$ future frames, leveraging teacher forcing during training for parallelization but reverting to stepwise generation in inference. A causal mask restricts temporal self-attention in $\mathcal{T}_D$ , ensuring each prediction can only utilize currently available or earlier frame features.
VidHRFormer Block: The central module alternates (a) local spatial multi-head self-attention (MHSA) within non-overlapping $K\times K$ patches of each frame, (b) convolutional feed-forward networks for spatial feature mixing, and (c) global temporal MHSA over frame sequences—with the optional addition of learned relative positional encoding.
Cross-Attention in Decoder: Before the last spatial feed-forward step, the decoder’s latent tokens attend to encoder outputs, integrating summary information from the encoded past.

This architecture allows parallel processing in spatial regions while enforcing temporal causality in generation.

2. Mathematical Formulation

The VPTR-PAR model factorizes the video prediction task as:

$p(X_{1:L+N}) = p(X_{1:L}) \cdot p(X_{L+1:L+N} \mid X_{1:L})$

with the conditional modeled autoregressively:

$p(X_{L+1:L+N} \mid X_{1:L}) = \prod_{t=L+1}^{L+N} p(X_t \mid X_{1:t-1})$

or, in chunked notation,

$\mathcal{T}_D$ 0

The attention operations are defined for both self- and cross-attention layers:

$\mathcal{T}_D$ 1

Spatial positional encodings may be absolute (sin-cos) or relative within patches; temporal encodings are fixed absolute 1D.

3. Training Objectives and Loss Functions

Training minimizes a composite loss over future frame predictions:

Mean-squared pixel error (MSE):

$\mathcal{T}_D$ 2

Gradient-difference loss (GDL):

$\mathcal{T}_D$ 3

The total loss:

$\mathcal{T}_D$ 4

No auxiliary losses are employed beyond this sum (Ye et al., 2022).

4. Inference Workflow and Causality Controls

At inference, the “recurrent over pixel” (RIP) approach is adopted to guard against feature-space drift. The pipeline proceeds as follows:

Past frames are encoded via a CNN.
The encoder output passes through $\mathcal{T}_D$ 5 to yield a temporal/spatial memory.
For each future time step $\mathcal{T}_D$ 6, the decoder receives the most recent encoded latent feature, along with previously predicted future latents, using one forward pass for each step. The prediction $\mathcal{T}_D$ 7 is mapped back to the pixel domain and re-encoded before being appended to the autoregressive context.

The sequence is processed in parallel within each pass, but causality is enforced by a mask that prohibits decoder attention to future steps.

VPTR-PAR enables a notable inference speedup compared to full autoregressive designs requiring $\mathcal{T}_D$ 8 passes across the full stack of transformer blocks.

5. Computational Complexity and Efficiency

The design of VPTR-PAR reduces the quadratic complexity of standard Transformers across the spatiotemporal flattening. A standard Transformer with sequence length $\mathcal{T}_D$ 9 costs $\mathcal{T}_E$ 0, while a VidHRFormer block achieves:

$\mathcal{T}_E$ 1

where $\mathcal{T}_E$ 2 is the number of non-overlapping patches per frame. Over $\mathcal{T}_E$ 3 time steps, the full AR method accumulates substantial overhead; the partial AR approach circumvents this by amortizing multi-frame processing and parallel computation within each forward pass. On standard datasets, VPTR-PAR is approximately 1.2× faster than its fully AR counterpart, and significantly more efficient than ConvLSTM-based baselines (Ye et al., 2022).

6. Empirical Results and Ablations

Quantitative assessment on KTH and BAIR motion datasets demonstrates that VPTR-PAR achieves performance closely matching full AR versions in most metrics (PSNR, SSIM, LPIPS), while providing meaningful acceleration:

Method	KTH PSNR	KTH SSIM	KTH LPIPS	BAIR PSNR	BAIR SSIM	BAIR LPIPS
VPTR-PAR	25.40	0.836	0.0848	15.94	0.745	0.1048
VPTR-FAR	26.13	0.859	0.0796	15.76	0.724	0.1107
VPTR-NAR	26.96	0.879	0.0861	17.77	0.813	0.0700

Ablations confirm that the separation of spatial and temporal attention is sufficient for effective modeling; removal of the RIP step results in severe degradation (LPIPS ≈ 0.193 vs. 0.0848), highlighting the importance of mitigating latent drift via pixel-space re-encoding (Ye et al., 2022).

7. Implementation and Practical Details

Salient implementation parameters:

Latent features: $\mathcal{T}_E$ 4, $\mathcal{T}_E$ 5, $\mathcal{T}_E$ 6
Patch size $\mathcal{T}_E$ 7 ( $\mathcal{T}_E$ 8 patches per frame)
Encoder: 4 VidHRFormer layers; Decoder: 8 layers + cross-attention and additional feed-forward
8-head MHSA with per-head dimension $\mathcal{T}_E$ 9
AdamW optimizer, learning rate $L$ 0, gradient clipping
Absolute 2D/1D positional encoding and optional learned RPE (+0.3 dB PSNR)
Source code and pretrained models are available at https://github.com/XiYe20/VPTR

VPTR-PAR, as a partially autoregressive variant, demonstrates that efficient transformer architectures can closely approach or match the performance of fully AR models in video prediction with materially improved efficiency and inference speed, validating the utility of spatial–temporal factorization and causally masked decoding (Ye et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Video Prediction by Efficient Transformers (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VPTR-PAR (Partially Autoregressive Model).

VPTR-PAR: Efficient Video Prediction

1. Model Architecture and Attention Mechanisms

2. Mathematical Formulation

3. Training Objectives and Loss Functions

4. Inference Workflow and Causality Controls

5. Computational Complexity and Efficiency

6. Empirical Results and Ablations

7. Implementation and Practical Details

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VPTR-PAR: Efficient Video Prediction

1. Model Architecture and Attention Mechanisms

2. Mathematical Formulation

3. Training Objectives and Loss Functions

4. Inference Workflow and Causality Controls

5. Computational Complexity and Efficiency

6. Empirical Results and Ablations

7. Implementation and Practical Details

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research