Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anticipative Video Transformer (AVT)

Updated 7 March 2026
  • The paper presents a novel attention-based framework that leverages causal self-attention for accurate action anticipation.
  • It employs a transformer architecture with a frame-level backbone and causal decoder to effectively model temporal progression.
  • Results show state-of-the-art performance on benchmarks like EpicKitchens and EGTEA Gaze+, highlighting its robust action forecasting capabilities.

An Anticipative Video Transformer (AVT) is a purely attention-based, end-to-end deep learning framework designed to model temporal progression in video streams for action anticipation. AVT leverages causal self-attention to maintain the sequential ordering of observed frames while capturing long-range dependencies, crucial for accurately forecasting imminent actions. AVT is fundamentally characterized by a transformer architecture that ingests sequential frame-level representations, applies temporally-masked self-attention, and jointly optimizes multiple anticipative objectives. Results demonstrate AVT's state-of-the-art performance across major action anticipation benchmarks such as EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads (Girdhar et al., 2021).

1. Motivation and Problem Setting

Action anticipation requires inferring the most probable next action based on the observed progression of prior actions in a temporally ordered video stream. This is substantively different from standard action recognition, which only classifies ongoing actions and does not reason over progression or structure in past events. AVT addresses the shortcomings of preceding LSTM-based and aggregation models by leveraging attention mechanisms that (1) encode all observed frames while preserving their temporal order and (2) allow direct modeling of long-range temporal dependencies without the limitations of recurrence or fixed-size pooling (Girdhar et al., 2021).

2. Architecture Overview

AVT's architecture consists of two core modules:

  • Frame-level backbone (BB): Each frame XtX_t is converted into a feature vector zt=B(Xt)z_t = B(X_t), where BB can be a Vision Transformer (ViT-B/16) or a standard CNN backbone (TSN, CSN). The ViT variant splits each 224×224224\times224 frame into patches, embeds them, adds spatial position embeddings, and processes via stacked multi-head self-attention layers.
  • Causal transformer decoder (DD): The sequence {z1,,zT}\{z_1, \dots, z_T\} is input to a transformer decoder with LL stacked layers. Temporal positional encodings are added, and each decoder step tt attends only to {z1,,zt}\{z_1,\dots,z_t\} via causal masking, enforcing online prediction constraints. Each decoder output z^t\hat z_t predicts the next frame's feature zt+1z_{t+1}, and a linear head θ\theta projects z^t\hat z_t to class logits y^t\hat y_t for action anticipation at T+1T+1.

Self-attention in both backbone and decoder is implemented as: Attention(Q,K,V)=softmax(QKTdk+M)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}} + M \right)V where MM is a strictly lower-triangular mask (causal).

3. Temporal Modeling and Dependency Capture

AVT maintains full sequential frame ordering by preserving the time index throughout encoding and masking attention in the decoder. Unlike aggregation methods, no pooling over the temporal axis is performed, ensuring that the learned representation ztz_t encodes only the observed history up to time tt. Long-range dependencies are captured efficiently, as each query at time tt can access all previous frames directly within an attention layer, thereby overcoming vanishing gradient and limited memory issues inherent to recurrence-driven models. Multi-layer stacking further enables hierarchical, high-order temporal dependency modeling (Girdhar et al., 2021).

4. Anticipative Training Objectives

AVT employs multi-task, joint training with three distinct losses:

  • Next-action cross-entropy loss:

Lnext=log[y^T]cT+1L_{\text{next}} = -\log [\hat y_T]_{c_{T+1}}

Optimizes the final decoder representation to predict the action at T+1T+1.

  • Predictive feature regression loss:

Lfeat=t=1T1z^tzt+122L_{\text{feat}} = \sum_{t=1}^{T-1} \|\hat z_t - z_{t+1}\|_2^2

Enforces that decoder outputs at each time step are predictive of the true future frame features.

  • Intermediate classification loss:

Lcls=t=1T1{log[y^t]ct+1,ct+10 0,ct+1=1L_{\text{cls}} = \sum_{t=1}^{T-1} \begin{cases} -\log [\hat y_t]_{c_{t+1}}, & c_{t+1} \ge 0 \ 0, & c_{t+1} = -1 \end{cases}

Directly encourages correct action labeling from intermediate predictions where labels are available.

The final objective is the sum: L=Lnext+Lcls+LfeatL = L_{\text{next}} + L_{\text{cls}} + L_{\text{feat}} Empirically, both auxiliary losses are essential for maximizing anticipative gains. Ablation studies confirm that the inclusion of LclsL_{\text{cls}} and LfeatL_{\text{feat}} improves recall@5 by 1.3–3.5% over naïve next-step-only training (Girdhar et al., 2021).

5. Experimental Protocols and Results

AVT was evaluated on the following benchmarks:

Dataset Frames/Window #Actions τₐ (anticipation) Metric AVT Result
EpicKitchens-100 10/10s 3,807 1s Recall@5 14.9% (ViT) RGB only; 16.7% (AVT⁺ fusion)
EpicKitchens-55 10/10s 2,513 1s Top-1, Top-5 action 14.4% (irCSN152+IG65M)
EGTEA Gaze+ 10/5s 106 0.5s Top-1/class-mean 43.0% (top-1), 35.2% (class-mean)
50-Salads 10/10s 17 1s Top-1 action 48.0%

AVT consistently outperformed RULSTM, ActionBanks, and baseline fusion models. It was the top performer in the EpicKitchens-100 CVPR’21 action anticipation challenge, with a class-mean recall@5 of 16.7% for RGB+OBJ fusion, compared to previous bests of 14.0% (Girdhar et al., 2021).

6. Empirical Analysis and Interpretability

AVT’s temporal attention heads exhibited interpretable patterns: spatial attention in the ViT backbone localized hands and objects without supervision, while temporal attention emphasized long-range dependencies for actions requiring extended context (e.g., “open fridge” after “gather items”), and short-range attention for actions triggered by immediate cues (e.g., “turn off tap”). Rollout experiments, wherein the model recursively predicts multiple future action steps, demonstrated emergent chaining (“action schema” discovery, e.g. wash → dry hand → put → close), indicating the model's capacity to synthesize plausible extended action sequences (Girdhar et al., 2021).

7. Extensions, Limitations, and Future Directions

AVT's design is agnostic to the specific visual backbone and can be extended to incorporate multiple modalities (audio, object detections), as demonstrated in follow-on work using mid-level modality fusion transformers (Zhong et al., 2022). Open challenges identified include scaling to dense future prediction in continuous streams without dense labels, leveraging self-supervised or predictive objectives for pretraining on large video corpora, extending to spatio-temporal localization tasks, and deeper integration of multi-modal signals. A plausible implication is that future research may further improve anticipative modeling by unifying multi-modal information at earlier stages and exploring richer predictive self-attention objectives (Girdhar et al., 2021, Zhong et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipative Video Transformer (AVT).