Papers
Topics
Authors
Recent
Search
2000 character limit reached

VPTR-NAR: Efficient Non-Autoregressive Video Prediction

Updated 14 May 2026
  • VPTR-NAR is a non-autoregressive model that uses Transformer architectures to predict future video frames with minimal delay.
  • It contrasts with autoregressive methods by generating frames in parallel, thereby reducing error propagation and boosting inference speed.
  • Empirical evaluations on benchmarks like BAIR show that VPTR-NAR achieves competitive quality metrics, particularly excelling in LPIPS scores.

The term VPTR-PAR (Partially Autoregressive Model) refers specifically to the partial autoregressive variant in the VPTR family of Transformer-based video prediction architectures introduced by Ye & Bilodeau in "Video Prediction by Efficient Transformers" (Ye et al., 2022). It also abbreviates the recently proposed Physical Autoregressive Model for robotic visual decision-making (Song et al., 13 Aug 2025). In both, the defining principle is an efficient autoregressive factorization—either over predicted visual tokens, or over joint vision-action ("physical") tokens—enabling rapid, temporally consistent future prediction. This article treats each line of work according to its technical distinctiveness and documented methodology.

1. Architectural Foundations and Model Components

VPTR-PAR (Video Prediction by Efficient Transformers)

VPTR-PAR is a hierarchical encoder–decoder Transformer for video prediction, constructed from VidHRFormer blocks, which employ spatially localized multi-head self-attention (MHSA) and temporal MHSA in alternation. The stack comprises:

  • Encoder TE\mathcal{T}_E: Processes past-frame features as latent tensors, using four VidHRFormer layers. Each block sequences local spatial MHSA (on non-overlapping frame patches), a depth-wise convolutional feed-forward network (Conv FFN), and temporal MHSA (using absolute 1D positional encodings).
  • Decoder TD\mathcal{T}_D: Generates future-frame features autoregressively, conditioned on both encoder outputs and previously decoded frames, via eight VidHRFormer layers. A crucial modification is the insertion (before the final Conv FFN) of a cross-attention layer, where the decoder queries access encoder outputs with a temporal causal mask enforcing proper autoregressive flow.

Key in VPTR-PAR is the partial autoregressive regime: at every decoder step, features for all previous predicted frames (including the last observed) are provided as input, so that future predictions are conditioned on the full available sequence thus far. Teacher-forcing is employed during training; at inference, predicted frames are decoded, mapped to pixel space, and re-encoded to latent features for subsequent prediction.

PAR (Physical Autoregressive Model for Manipulation)

The Physical Autoregressive Model (PAR) adapts autoregressive video modeling to robotics by fusing action and visual representations into a single "physical token" per timestep (Song et al., 13 Aug 2025). Architecturally:

  • Transformer Backbone: PAR shares the causal Transformer trained for video generation (NOVA), with rotary positional encoding, and takes a sequence of tokens encoding (i) language/task, (ii) image frames, and (iii) low-level action chunks.
  • Physical Tokenization: Each timestep's physical token PnP_n concatenates KOK_O visual tokens (from a frozen 3D-VAE) and KAK_A action tokens (MLP-encoded control vectors). An initial "Begin-Of-Action" token is used at n=0n=0.
  • Diffusion-Transformer Decoder: Continuous-valued frame and action tokens are rendered via separate DiT modules, each performing diffusion denoising in the VAE latent space, conditioned on the Transformer's output vector at each step.

Specialized causal masking and cross-modal attention rules enable efficient temporal reasoning and support an implicit inverse-kinematics regime, where predicted visual features for a step inform the prediction of corresponding actions.

2. Mathematical and Algorithmic Formulation

VPTR-PAR

The model implements a joint factorization over observed and predicted frames: p(X1:L+N)=p(X1:L)×p(XL+1:L+NX1:L),p(X_{1:L+N}) = p(X_{1:L}) \times p(X_{L+1:L+N} \mid X_{1:L}), with conditional prediction factorized as

p(XL+1:L+NX1:L)=t=L+1L+Np(XtX1:t1).p(X_{L+1:L+N} \mid X_{1:L}) = \prod_{t=L+1}^{L+N} p(X_t \mid X_{1:t-1}).

The encoder consumes past latent features z1,,zLz_{1},\ldots,z_{L}, while at each decode step, the inputs are [zL,,zL+t1][z_L,\ldots,z_{L+t-1}], allowing the decoder to predict TD\mathcal{T}_D0. The corresponding image TD\mathcal{T}_D1 is synthesized by the DecoderCNN and then re-encoded for the next step.

VidHRFormer blocks apply:

  • Local Spatial MHSA: Within-patch attention for low computational cost.
  • Temporal MHSA: Attention along the temporal axis with causal masking in the decoder.
  • Cross-Attention: In the decoder, links decoder tokens with encoder memories.

PAR

PAR models the joint probability of task, image, and action sequences as

TD\mathcal{T}_D2

Each physical token TD\mathcal{T}_D3 (visual and action tokens) is autoregressively predicted using the Transformer backbone.

The DiT modules optimize denoising loss for both image and action components: TD\mathcal{T}_D4 with diffusion process TD\mathcal{T}_D5.

3. Training Procedures and Loss Functions

VPTR-PAR

Training employs a composite loss: TD\mathcal{T}_D6 where TD\mathcal{T}_D7 is the gradient-difference loss to enforce sharpness in predicted frames. No auxiliary losses are required beyond the pixel MSE and GDL terms.

Teacher-forcing is used so that, at each training step, ground-truth latents are provided as input to the decoder, ensuring stable prediction and mitigating compounding errors during rollout.

PAR

PAR's training loss balances observation (frame) and action objectives, summing denoising losses from DiT decoders for both modalities: TD\mathcal{T}_D8 Full sequence teacher-forcing enables efficient parallelized training. During training, causal masks are enforced in temporal, within-chunk (actions), and across-modal attention.

4. Inference and Computational Efficiency

VPTR-PAR

Inference uses a "recurrent over pixel" (RIP) strategy to suppress drift: after each predicted frame, the output is mapped back to pixel space and re-encoded before being fed as input to the decoder in the next step. This procedure stabilizes predictions over long horizons. All decoder inputs for TD\mathcal{T}_D9 frames are processed in a single forward pass per predicted frame.

Compared to the fully autoregressive VPTR-FAR variant, which recomputes the full stack for each new prediction, VPTR-PAR is approximately 1.2 times faster on KTH and substantially more efficient than convolutional LSTM baselines. Complexity is dominated by spatial and temporal MHSA over patches, scaling as

PnP_n0

per decode pass, with PnP_n1 the number of spatial and windowed patches, respectively.

PAR

Inference with PAR deploys a Key-Value (KV) cache, so each incremental step only updates the newest token, avoiding recomputation over the full history. Cross-modal masking ensures correct temporal and causal flow between the image and action channels, and the parallel DiT decoders synthesize continuous-valued outputs for both.

Distinct RoPE embeddings for visual and control streams preserve temporal information at differing sampling rates, critical for fine-grained visuomotor rollouts.

5. Empirical Performance and Comparative Analysis

VPTR-PAR

KTH Dataset: For 20-frame prediction after 10 past inputs:

Model PSNR SSIM LPIPS
VPTR-PAR 25.40 0.836 0.0848
VPTR-FAR 26.13 0.859 0.0796
VPTR-NAR 26.96 0.879 0.0861

BAIR Dataset: For 28-frame prediction after 2 past inputs:

Model PSNR SSIM LPIPS
VPTR-PAR 15.94 0.745 0.1048
VPTR-FAR 15.76 0.724 0.1107
VPTR-NAR 17.77 0.813 0.0700

VPTR-PAR achieves comparable PSNR and SSIM to the fully autoregressive variant while reducing inference time, and outperforms convolutional-LSTM SOTA in LPIPS.

Ablation shows the importance of the “recurrent over pixel” loop: switching to “recurrent-in-latent” causes LPIPS to degrade from 0.0848 to ≈0.193.

PAR

On the ManiSkill benchmark, PAR achieves:

Method PushCube PickCube StackCube Avg.
PAR 100% 73% 48% 74%
RDT (1.3B, pretrain) 100% 77% 74% 84%
Diffusion Policy 88% 40% 80% 69%

PAR matches or closely tracks much larger action-pretrained models—even though it uses only video pretraining for world dynamics. Qualitative predictions closely align with ground-truth robotic trajectories, confirmed by pixel- and token-level attention maps.

6. Practical Implementation and Design Considerations

VPTR-PAR

  • Latent Shape: PnP_n2.
  • Patch Size: PnP_n3, giving PnP_n4 patches per frame.
  • Layers: Encoder: 4 VidHRFormer; Decoder: 8 + cross-attention + additional FFN.
  • Optimization: AdamW (lr PnP_n5), gradient clipping.
  • Positional Encoding: Absolute 2D (spatial), absolute 1D (temporal); optional RPE (+0.3 dB PSNR gain).
  • Open Source: https://github.com/XiYe20/VPTR

PAR

  • Physical Token Structure: Concatenates 3D-VAE visual latents and MLP-encoded actions; initial step uses a special "Begin-Of-Action" token.
  • Transformer Backbone: Depth and head configuration matched to the original NOVA model.
  • Diffusion Decoder: Two DiT modules operate in continuous latent space.
  • Causal Masking: Carefully tailored to enforce both temporal and intra-token dependencies, cross-modal semantics, and action–vision alignment.
  • Parallel Training: Enabled by teacher-forcing.
  • Inference Optimization: KV-cache deployed for incremental token generation.

7. Broader Context and Implications

The VPTR-PAR approach demonstrates that careful attention to architectural efficiency and autoregressive dependencies can yield video prediction models competitive with much deeper ConvLSTM variants (Ye et al., 2022). Its partial autoregressive design enables faster rollout with only marginal loss in performance relative to full AR models.

The PAR variant for robotic manipulation establishes that pretrained video dynamics, fused directly with autoregressively generated actions, are sufficient to achieve robust performance on challenging manipulation tasks—without manual action pretraining (Song et al., 13 Aug 2025). This suggests that large-scale video modeling can provide a universal world model foundation for visuomotor decision-making, obviating the need for extensive supervised datasets of low-level robot actions. A plausible implication is the emergence of a new class of compact, generalist visuomotor agents built atop pretrained video Transformer backbones, with fine-grained, efficient action-decoding realized through partial or physical autoregressive modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VPTR-NAR (Non-Autoregressive Model).