Anticipative Video Transformer (AVT)
- The paper presents a novel attention-based framework that leverages causal self-attention for accurate action anticipation.
- It employs a transformer architecture with a frame-level backbone and causal decoder to effectively model temporal progression.
- Results show state-of-the-art performance on benchmarks like EpicKitchens and EGTEA Gaze+, highlighting its robust action forecasting capabilities.
An Anticipative Video Transformer (AVT) is a purely attention-based, end-to-end deep learning framework designed to model temporal progression in video streams for action anticipation. AVT leverages causal self-attention to maintain the sequential ordering of observed frames while capturing long-range dependencies, crucial for accurately forecasting imminent actions. AVT is fundamentally characterized by a transformer architecture that ingests sequential frame-level representations, applies temporally-masked self-attention, and jointly optimizes multiple anticipative objectives. Results demonstrate AVT's state-of-the-art performance across major action anticipation benchmarks such as EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads (Girdhar et al., 2021).
1. Motivation and Problem Setting
Action anticipation requires inferring the most probable next action based on the observed progression of prior actions in a temporally ordered video stream. This is substantively different from standard action recognition, which only classifies ongoing actions and does not reason over progression or structure in past events. AVT addresses the shortcomings of preceding LSTM-based and aggregation models by leveraging attention mechanisms that (1) encode all observed frames while preserving their temporal order and (2) allow direct modeling of long-range temporal dependencies without the limitations of recurrence or fixed-size pooling (Girdhar et al., 2021).
2. Architecture Overview
AVT's architecture consists of two core modules:
- Frame-level backbone (): Each frame is converted into a feature vector , where can be a Vision Transformer (ViT-B/16) or a standard CNN backbone (TSN, CSN). The ViT variant splits each frame into patches, embeds them, adds spatial position embeddings, and processes via stacked multi-head self-attention layers.
- Causal transformer decoder (): The sequence is input to a transformer decoder with stacked layers. Temporal positional encodings are added, and each decoder step attends only to via causal masking, enforcing online prediction constraints. Each decoder output predicts the next frame's feature , and a linear head projects to class logits for action anticipation at .
Self-attention in both backbone and decoder is implemented as: where is a strictly lower-triangular mask (causal).
3. Temporal Modeling and Dependency Capture
AVT maintains full sequential frame ordering by preserving the time index throughout encoding and masking attention in the decoder. Unlike aggregation methods, no pooling over the temporal axis is performed, ensuring that the learned representation encodes only the observed history up to time . Long-range dependencies are captured efficiently, as each query at time can access all previous frames directly within an attention layer, thereby overcoming vanishing gradient and limited memory issues inherent to recurrence-driven models. Multi-layer stacking further enables hierarchical, high-order temporal dependency modeling (Girdhar et al., 2021).
4. Anticipative Training Objectives
AVT employs multi-task, joint training with three distinct losses:
- Next-action cross-entropy loss:
Optimizes the final decoder representation to predict the action at .
- Predictive feature regression loss:
Enforces that decoder outputs at each time step are predictive of the true future frame features.
- Intermediate classification loss:
Directly encourages correct action labeling from intermediate predictions where labels are available.
The final objective is the sum: Empirically, both auxiliary losses are essential for maximizing anticipative gains. Ablation studies confirm that the inclusion of and improves recall@5 by 1.3–3.5% over naïve next-step-only training (Girdhar et al., 2021).
5. Experimental Protocols and Results
AVT was evaluated on the following benchmarks:
| Dataset | Frames/Window | #Actions | τₐ (anticipation) | Metric | AVT Result |
|---|---|---|---|---|---|
| EpicKitchens-100 | 10/10s | 3,807 | 1s | Recall@5 | 14.9% (ViT) RGB only; 16.7% (AVT⁺ fusion) |
| EpicKitchens-55 | 10/10s | 2,513 | 1s | Top-1, Top-5 action | 14.4% (irCSN152+IG65M) |
| EGTEA Gaze+ | 10/5s | 106 | 0.5s | Top-1/class-mean | 43.0% (top-1), 35.2% (class-mean) |
| 50-Salads | 10/10s | 17 | 1s | Top-1 action | 48.0% |
AVT consistently outperformed RULSTM, ActionBanks, and baseline fusion models. It was the top performer in the EpicKitchens-100 CVPR’21 action anticipation challenge, with a class-mean recall@5 of 16.7% for RGB+OBJ fusion, compared to previous bests of 14.0% (Girdhar et al., 2021).
6. Empirical Analysis and Interpretability
AVT’s temporal attention heads exhibited interpretable patterns: spatial attention in the ViT backbone localized hands and objects without supervision, while temporal attention emphasized long-range dependencies for actions requiring extended context (e.g., “open fridge” after “gather items”), and short-range attention for actions triggered by immediate cues (e.g., “turn off tap”). Rollout experiments, wherein the model recursively predicts multiple future action steps, demonstrated emergent chaining (“action schema” discovery, e.g. wash → dry hand → put → close), indicating the model's capacity to synthesize plausible extended action sequences (Girdhar et al., 2021).
7. Extensions, Limitations, and Future Directions
AVT's design is agnostic to the specific visual backbone and can be extended to incorporate multiple modalities (audio, object detections), as demonstrated in follow-on work using mid-level modality fusion transformers (Zhong et al., 2022). Open challenges identified include scaling to dense future prediction in continuous streams without dense labels, leveraging self-supervised or predictive objectives for pretraining on large video corpora, extending to spatio-temporal localization tasks, and deeper integration of multi-modal signals. A plausible implication is that future research may further improve anticipative modeling by unifying multi-modal information at earlier stages and exploring richer predictive self-attention objectives (Girdhar et al., 2021, Zhong et al., 2022).