VPTR-NAR: Efficient Non-Autoregressive Video Prediction

Updated 14 May 2026

VPTR-NAR is a non-autoregressive model that uses Transformer architectures to predict future video frames with minimal delay.
It contrasts with autoregressive methods by generating frames in parallel, thereby reducing error propagation and boosting inference speed.
Empirical evaluations on benchmarks like BAIR show that VPTR-NAR achieves competitive quality metrics, particularly excelling in LPIPS scores.

The term VPTR-PAR (Partially Autoregressive Model) refers specifically to the partial autoregressive variant in the VPTR family of Transformer-based video prediction architectures introduced by Ye & Bilodeau in "Video Prediction by Efficient Transformers" (Ye et al., 2022). It also abbreviates the recently proposed Physical Autoregressive Model for robotic visual decision-making (Song et al., 13 Aug 2025). In both, the defining principle is an efficient autoregressive factorization—either over predicted visual tokens, or over joint vision-action ("physical") tokens—enabling rapid, temporally consistent future prediction. This article treats each line of work according to its technical distinctiveness and documented methodology.

1. Architectural Foundations and Model Components

VPTR-PAR (Video Prediction by Efficient Transformers)

VPTR-PAR is a hierarchical encoder–decoder Transformer for video prediction, constructed from VidHRFormer blocks, which employ spatially localized multi-head self-attention (MHSA) and temporal MHSA in alternation. The stack comprises:

Encoder $\mathcal{T}_E$ : Processes past-frame features as latent tensors, using four VidHRFormer layers. Each block sequences local spatial MHSA (on non-overlapping frame patches), a depth-wise convolutional feed-forward network (Conv FFN), and temporal MHSA (using absolute 1D positional encodings).
Decoder $\mathcal{T}_D$ : Generates future-frame features autoregressively, conditioned on both encoder outputs and previously decoded frames, via eight VidHRFormer layers. A crucial modification is the insertion (before the final Conv FFN) of a cross-attention layer, where the decoder queries access encoder outputs with a temporal causal mask enforcing proper autoregressive flow.

Key in VPTR-PAR is the partial autoregressive regime: at every decoder step, features for all previous predicted frames (including the last observed) are provided as input, so that future predictions are conditioned on the full available sequence thus far. Teacher-forcing is employed during training; at inference, predicted frames are decoded, mapped to pixel space, and re-encoded to latent features for subsequent prediction.

PAR (Physical Autoregressive Model for Manipulation)

The Physical Autoregressive Model (PAR) adapts autoregressive video modeling to robotics by fusing action and visual representations into a single "physical token" per timestep (Song et al., 13 Aug 2025). Architecturally:

Transformer Backbone: PAR shares the causal Transformer trained for video generation (NOVA), with rotary positional encoding, and takes a sequence of tokens encoding (i) language/task, (ii) image frames, and (iii) low-level action chunks.
Physical Tokenization: Each timestep's physical token $P_n$ concatenates $K_O$ visual tokens (from a frozen 3D-VAE) and $K_A$ action tokens (MLP-encoded control vectors). An initial "Begin-Of-Action" token is used at $n=0$ .
Diffusion-Transformer Decoder: Continuous-valued frame and action tokens are rendered via separate DiT modules, each performing diffusion denoising in the VAE latent space, conditioned on the Transformer's output vector at each step.

Specialized causal masking and cross-modal attention rules enable efficient temporal reasoning and support an implicit inverse-kinematics regime, where predicted visual features for a step inform the prediction of corresponding actions.

2. Mathematical and Algorithmic Formulation

VPTR-PAR

The model implements a joint factorization over observed and predicted frames: $p(X_{1:L+N}) = p(X_{1:L}) \times p(X_{L+1:L+N} \mid X_{1:L}),$ with conditional prediction factorized as

$p(X_{L+1:L+N} \mid X_{1:L}) = \prod_{t=L+1}^{L+N} p(X_t \mid X_{1:t-1}).$

The encoder consumes past latent features $z_{1},\ldots,z_{L}$ , while at each decode step, the inputs are $[z_L,\ldots,z_{L+t-1}]$ , allowing the decoder to predict $\mathcal{T}_D$ 0. The corresponding image $\mathcal{T}_D$ 1 is synthesized by the DecoderCNN and then re-encoded for the next step.

VidHRFormer blocks apply:

Local Spatial MHSA: Within-patch attention for low computational cost.
Temporal MHSA: Attention along the temporal axis with causal masking in the decoder.
Cross-Attention: In the decoder, links decoder tokens with encoder memories.

PAR

PAR models the joint probability of task, image, and action sequences as

$\mathcal{T}_D$ 2

Each physical token $\mathcal{T}_D$ 3 (visual and action tokens) is autoregressively predicted using the Transformer backbone.

The DiT modules optimize denoising loss for both image and action components: $\mathcal{T}_D$ 4 with diffusion process $\mathcal{T}_D$ 5.

3. Training Procedures and Loss Functions

VPTR-PAR

Training employs a composite loss: $\mathcal{T}_D$ 6 where $\mathcal{T}_D$ 7 is the gradient-difference loss to enforce sharpness in predicted frames. No auxiliary losses are required beyond the pixel MSE and GDL terms.

Teacher-forcing is used so that, at each training step, ground-truth latents are provided as input to the decoder, ensuring stable prediction and mitigating compounding errors during rollout.

PAR

PAR's training loss balances observation (frame) and action objectives, summing denoising losses from DiT decoders for both modalities: $\mathcal{T}_D$ 8 Full sequence teacher-forcing enables efficient parallelized training. During training, causal masks are enforced in temporal, within-chunk (actions), and across-modal attention.

4. Inference and Computational Efficiency

VPTR-PAR

Inference uses a "recurrent over pixel" (RIP) strategy to suppress drift: after each predicted frame, the output is mapped back to pixel space and re-encoded before being fed as input to the decoder in the next step. This procedure stabilizes predictions over long horizons. All decoder inputs for $\mathcal{T}_D$ 9 frames are processed in a single forward pass per predicted frame.

Compared to the fully autoregressive VPTR-FAR variant, which recomputes the full stack for each new prediction, VPTR-PAR is approximately 1.2 times faster on KTH and substantially more efficient than convolutional LSTM baselines. Complexity is dominated by spatial and temporal MHSA over patches, scaling as

$P_n$ 0

per decode pass, with $P_n$ 1 the number of spatial and windowed patches, respectively.

PAR

Inference with PAR deploys a Key-Value (KV) cache, so each incremental step only updates the newest token, avoiding recomputation over the full history. Cross-modal masking ensures correct temporal and causal flow between the image and action channels, and the parallel DiT decoders synthesize continuous-valued outputs for both.

Distinct RoPE embeddings for visual and control streams preserve temporal information at differing sampling rates, critical for fine-grained visuomotor rollouts.

5. Empirical Performance and Comparative Analysis

VPTR-PAR

KTH Dataset: For 20-frame prediction after 10 past inputs:

Model	PSNR	SSIM	LPIPS
VPTR-PAR	25.40	0.836	0.0848
VPTR-FAR	26.13	0.859	0.0796
VPTR-NAR	26.96	0.879	0.0861

BAIR Dataset: For 28-frame prediction after 2 past inputs:

Model	PSNR	SSIM	LPIPS
VPTR-PAR	15.94	0.745	0.1048
VPTR-FAR	15.76	0.724	0.1107
VPTR-NAR	17.77	0.813	0.0700

VPTR-PAR achieves comparable PSNR and SSIM to the fully autoregressive variant while reducing inference time, and outperforms convolutional-LSTM SOTA in LPIPS.

Ablation shows the importance of the “recurrent over pixel” loop: switching to “recurrent-in-latent” causes LPIPS to degrade from 0.0848 to ≈0.193.

PAR

On the ManiSkill benchmark, PAR achieves:

Method	PushCube	PickCube	StackCube	Avg.
PAR	100%	73%	48%	74%
RDT (1.3B, pretrain)	100%	77%	74%	84%
Diffusion Policy	88%	40%	80%	69%

PAR matches or closely tracks much larger action-pretrained models—even though it uses only video pretraining for world dynamics. Qualitative predictions closely align with ground-truth robotic trajectories, confirmed by pixel- and token-level attention maps.

6. Practical Implementation and Design Considerations

VPTR-PAR

Latent Shape: $P_n$ 2.
Patch Size: $P_n$ 3, giving $P_n$ 4 patches per frame.
Layers: Encoder: 4 VidHRFormer; Decoder: 8 + cross-attention + additional FFN.
Optimization: AdamW (lr $P_n$ 5), gradient clipping.
Positional Encoding: Absolute 2D (spatial), absolute 1D (temporal); optional RPE (+0.3 dB PSNR gain).
Open Source: https://github.com/XiYe20/VPTR

PAR

Physical Token Structure: Concatenates 3D-VAE visual latents and MLP-encoded actions; initial step uses a special "Begin-Of-Action" token.
Transformer Backbone: Depth and head configuration matched to the original NOVA model.
Diffusion Decoder: Two DiT modules operate in continuous latent space.
Causal Masking: Carefully tailored to enforce both temporal and intra-token dependencies, cross-modal semantics, and action–vision alignment.
Parallel Training: Enabled by teacher-forcing.
Inference Optimization: KV-cache deployed for incremental token generation.

7. Broader Context and Implications

The VPTR-PAR approach demonstrates that careful attention to architectural efficiency and autoregressive dependencies can yield video prediction models competitive with much deeper ConvLSTM variants (Ye et al., 2022). Its partial autoregressive design enables faster rollout with only marginal loss in performance relative to full AR models.

The PAR variant for robotic manipulation establishes that pretrained video dynamics, fused directly with autoregressively generated actions, are sufficient to achieve robust performance on challenging manipulation tasks—without manual action pretraining (Song et al., 13 Aug 2025). This suggests that large-scale video modeling can provide a universal world model foundation for visuomotor decision-making, obviating the need for extensive supervised datasets of low-level robot actions. A plausible implication is the emergence of a new class of compact, generalist visuomotor agents built atop pretrained video Transformer backbones, with fine-grained, efficient action-decoding realized through partial or physical autoregressive modeling.

Markdown Report Issue Upgrade to Chat

References (2)

Video Prediction by Efficient Transformers (2022)

Physical Autoregressive Model for Robotic Manipulation without Action Pretraining (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VPTR-NAR (Non-Autoregressive Model).

VPTR-NAR: Efficient Non-Autoregressive Video Prediction

1. Architectural Foundations and Model Components

VPTR-PAR (Video Prediction by Efficient Transformers)

PAR (Physical Autoregressive Model for Manipulation)

2. Mathematical and Algorithmic Formulation

VPTR-PAR

PAR

3. Training Procedures and Loss Functions

VPTR-PAR

PAR

4. Inference and Computational Efficiency

VPTR-PAR

PAR

5. Empirical Performance and Comparative Analysis

VPTR-PAR

PAR

6. Practical Implementation and Design Considerations

VPTR-PAR

PAR

7. Broader Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VPTR-NAR: Efficient Non-Autoregressive Video Prediction

1. Architectural Foundations and Model Components

VPTR-PAR (Video Prediction by Efficient Transformers)

PAR (Physical Autoregressive Model for Manipulation)

2. Mathematical and Algorithmic Formulation

VPTR-PAR

PAR

3. Training Procedures and Loss Functions

VPTR-PAR

PAR

4. Inference and Computational Efficiency

VPTR-PAR

PAR

5. Empirical Performance and Comparative Analysis

VPTR-PAR

PAR

6. Practical Implementation and Design Considerations

VPTR-PAR

PAR

7. Broader Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research