Anticipative Video Transformer (AVT)

Updated 7 March 2026

The paper presents a novel attention-based framework that leverages causal self-attention for accurate action anticipation.
It employs a transformer architecture with a frame-level backbone and causal decoder to effectively model temporal progression.
Results show state-of-the-art performance on benchmarks like EpicKitchens and EGTEA Gaze+, highlighting its robust action forecasting capabilities.

An Anticipative Video Transformer (AVT) is a purely attention-based, end-to-end deep learning framework designed to model temporal progression in video streams for action anticipation. AVT leverages causal self-attention to maintain the sequential ordering of observed frames while capturing long-range dependencies, crucial for accurately forecasting imminent actions. AVT is fundamentally characterized by a transformer architecture that ingests sequential frame-level representations, applies temporally-masked self-attention, and jointly optimizes multiple anticipative objectives. Results demonstrate AVT's state-of-the-art performance across major action anticipation benchmarks such as EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads (Girdhar et al., 2021).

1. Motivation and Problem Setting

Action anticipation requires inferring the most probable next action based on the observed progression of prior actions in a temporally ordered video stream. This is substantively different from standard action recognition, which only classifies ongoing actions and does not reason over progression or structure in past events. AVT addresses the shortcomings of preceding LSTM-based and aggregation models by leveraging attention mechanisms that (1) encode all observed frames while preserving their temporal order and (2) allow direct modeling of long-range temporal dependencies without the limitations of recurrence or fixed-size pooling (Girdhar et al., 2021).

2. Architecture Overview

AVT's architecture consists of two core modules:

Frame-level backbone ( $B$ ): Each frame $X_t$ is converted into a feature vector $z_t = B(X_t)$ , where $B$ can be a Vision Transformer (ViT-B/16) or a standard CNN backbone (TSN, CSN). The ViT variant splits each $224\times224$ frame into patches, embeds them, adds spatial position embeddings, and processes via stacked multi-head self-attention layers.
Causal transformer decoder ( $D$ ): The sequence $\{z_1, \dots, z_T\}$ is input to a transformer decoder with $L$ stacked layers. Temporal positional encodings are added, and each decoder step $t$ attends only to $\{z_1,\dots,z_t\}$ via causal masking, enforcing online prediction constraints. Each decoder output $\hat z_t$ predicts the next frame's feature $z_{t+1}$ , and a linear head $\theta$ projects $\hat z_t$ to class logits $\hat y_t$ for action anticipation at $T+1$ .

Self-attention in both backbone and decoder is implemented as: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}} + M \right)V$ where $M$ is a strictly lower-triangular mask (causal).

3. Temporal Modeling and Dependency Capture

AVT maintains full sequential frame ordering by preserving the time index throughout encoding and masking attention in the decoder. Unlike aggregation methods, no pooling over the temporal axis is performed, ensuring that the learned representation $z_t$ encodes only the observed history up to time $t$ . Long-range dependencies are captured efficiently, as each query at time $t$ can access all previous frames directly within an attention layer, thereby overcoming vanishing gradient and limited memory issues inherent to recurrence-driven models. Multi-layer stacking further enables hierarchical, high-order temporal dependency modeling (Girdhar et al., 2021).

4. Anticipative Training Objectives

AVT employs multi-task, joint training with three distinct losses:

Next-action cross-entropy loss:

$L_{\text{next}} = -\log [\hat y_T]_{c_{T+1}}$

Optimizes the final decoder representation to predict the action at $T+1$ .

Predictive feature regression loss:

$L_{\text{feat}} = \sum_{t=1}^{T-1} \|\hat z_t - z_{t+1}\|_2^2$

Enforces that decoder outputs at each time step are predictive of the true future frame features.

Intermediate classification loss:

$L_{\text{cls}} = \sum_{t=1}^{T-1} \begin{cases} -\log [\hat y_t]_{c_{t+1}}, & c_{t+1} \ge 0 \ 0, & c_{t+1} = -1 \end{cases}$

Directly encourages correct action labeling from intermediate predictions where labels are available.

The final objective is the sum: $L = L_{\text{next}} + L_{\text{cls}} + L_{\text{feat}}$ Empirically, both auxiliary losses are essential for maximizing anticipative gains. Ablation studies confirm that the inclusion of $L_{\text{cls}}$ and $L_{\text{feat}}$ improves recall@5 by 1.3–3.5% over naïve next-step-only training (Girdhar et al., 2021).

5. Experimental Protocols and Results

AVT was evaluated on the following benchmarks:

Dataset	Frames/Window	#Actions	τₐ (anticipation)	Metric	AVT Result
EpicKitchens-100	10/10s	3,807	1s	Recall@5	14.9% (ViT) RGB only; 16.7% (AVT⁺ fusion)
EpicKitchens-55	10/10s	2,513	1s	Top-1, Top-5 action	14.4% (irCSN152+IG65M)
EGTEA Gaze+	10/5s	106	0.5s	Top-1/class-mean	43.0% (top-1), 35.2% (class-mean)
50-Salads	10/10s	17	1s	Top-1 action	48.0%

AVT consistently outperformed RULSTM, ActionBanks, and baseline fusion models. It was the top performer in the EpicKitchens-100 CVPR’21 action anticipation challenge, with a class-mean recall@5 of 16.7% for RGB+OBJ fusion, compared to previous bests of 14.0% (Girdhar et al., 2021).

6. Empirical Analysis and Interpretability

AVT’s temporal attention heads exhibited interpretable patterns: spatial attention in the ViT backbone localized hands and objects without supervision, while temporal attention emphasized long-range dependencies for actions requiring extended context (e.g., “open fridge” after “gather items”), and short-range attention for actions triggered by immediate cues (e.g., “turn off tap”). Rollout experiments, wherein the model recursively predicts multiple future action steps, demonstrated emergent chaining (“action schema” discovery, e.g. wash → dry hand → put → close), indicating the model's capacity to synthesize plausible extended action sequences (Girdhar et al., 2021).

7. Extensions, Limitations, and Future Directions

AVT's design is agnostic to the specific visual backbone and can be extended to incorporate multiple modalities (audio, object detections), as demonstrated in follow-on work using mid-level modality fusion transformers (Zhong et al., 2022). Open challenges identified include scaling to dense future prediction in continuous streams without dense labels, leveraging self-supervised or predictive objectives for pretraining on large video corpora, extending to spatio-temporal localization tasks, and deeper integration of multi-modal signals. A plausible implication is that future research may further improve anticipative modeling by unifying multi-modal information at earlier stages and exploring richer predictive self-attention objectives (Girdhar et al., 2021, Zhong et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Anticipative Video Transformer (2021)

Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipative Video Transformer (AVT).

Anticipative Video Transformer (AVT)

1. Motivation and Problem Setting

2. Architecture Overview

3. Temporal Modeling and Dependency Capture

4. Anticipative Training Objectives

5. Experimental Protocols and Results

6. Empirical Analysis and Interpretability

7. Extensions, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Anticipative Video Transformer (AVT)

1. Motivation and Problem Setting

2. Architecture Overview

3. Temporal Modeling and Dependency Capture

4. Anticipative Training Objectives

5. Experimental Protocols and Results

6. Empirical Analysis and Interpretability

7. Extensions, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research