Action Anticipation at a Glimpse (AAG)

Updated 5 February 2026

The paper presents AAG, which predicts future actions using a single image frame and multimodal inputs, significantly reducing computation compared to video-based methods.
Action Anticipation at a Glimpse (AAG) is a method that combines visual appearance, depth geometry, and action history using bidirectional cross-attention for robust prediction in structured tasks.
Key insights include effective blur-based keyframe selection and stochastic corruption of history embeddings to enhance model resilience against noisy inputs.

Action Anticipation at a Glimpse (AAG) refers to a class of methods in activity understanding that predict future human actions by leveraging limited perceptual information—typically a single image frame enriched with multimodal context. The AAG paradigm challenges the assumption that dense temporal video modeling is necessary for effective action anticipation, demonstrating that much of the predictive signal can be harnessed from a single spatial “glimpse” plus appropriate geometric and semantic context. The AAG framework, along with its improved variant AAG⁺, achieves competitive performance to video-based models on complex, multi-step procedural datasets while substantially reducing computation and latency (Benavent-Lledo et al., 2 Dec 2025, Benavent-Lledo et al., 29 Jan 2026).

1. Foundations and Problem Statement

Action anticipation—predicting the label of the next action $\delta$ seconds before it occurs—has traditionally relied on aggregating features over multiple video frames. The AAG approach reformulates this as a single-frame classification task: at time $t$ , given an observed RGB frame $f_t$ and potentially auxiliary modalities $\mathcal{M}_t$ (e.g., depth, prior action history), predict the forthcoming action label $y \in \{1, \ldots, C\}$ :

$p(y \mid \mathcal{M}_t) = \mathrm{softmax}(h(\mathcal{M}_t))$

where $h(\cdot)$ is the model head and $\mathcal{M}_t$ includes all available modalities. The cross-entropy loss is used for training:

$\mathcal{L}_{\mathrm{CE}} = -\sum_{c=1}^C \mathbf{1}[y = c] \log p(y = c \mid \mathcal{M}_t)$

Empirical work on IKEA-ASM, Meccano, and Assembly101 demonstrates that AAG can rival or surpass video-based anticipators, particularly for structured, low-variance tasks (Benavent-Lledo et al., 2 Dec 2025, Benavent-Lledo et al., 29 Jan 2026).

2. Multimodal Components and Feature Extraction

AAG models integrate three principal information streams:

RGB Appearance: Extracted from a single frame using a self-supervised vision transformer backbone (e.g., DINOv2 or DINOv3 for AAG⁺); outputs CLS-token embeddings.
Depth Geometry: Estimated via monocular depth models (e.g., Depth Anything v2), color-mapped to pseudo-RGB and processed through the same transformer, offering robustness in spatial reasoning, especially for exocentric and structured scenes.
Action History: Encodes semantic memory from the last $N$ atomic actions. Action history is represented either by ground-truth/predicted labels as class embeddings or via text encodings (DistilBERT). The most effective encoding is per-action embedding concatenation.

AAG systematically demonstrated that depth cues primarily boost exocentric performance, while action history provides critical context where appearance is ambiguous or the upcoming action is under-determined (Benavent-Lledo et al., 2 Dec 2025, Benavent-Lledo et al., 29 Jan 2026).

Multimodal fusion in AAG involves a two-stage transformer-based pipeline:

Visual Fusion: RGB and depth embeddings are fused by cross-attention, where RGB features $t$ 0 act as queries and depth features $t$ 1 as keys/values:

$t$ 2

$t$ 3

Full Model Fusion: Visual and action-history embeddings are concatenated and jointly processed by a self-attention transformer (AAG) or by bidirectional cross-attention with gated fusion (AAG⁺):

$t$ 4

$t$ 5

Bidirectional cross-attention with learnable gating offers superior adaptability, especially under noisy or imperfect action histories (Benavent-Lledo et al., 29 Jan 2026).

4. Keyframe Selection and Robustness

AAG employs keyframe selection to maximize input informativeness—essential in egocentric or cluttered settings. The optimal policy is blurriness-based selection: within a temporal window, the first frame whose Laplacian variance exceeds a dataset-tuned threshold $t$ 6 is chosen. For IKEA-ASM/Assembly101, $t$ 7; for Meccano, $t$ 8. This method outperforms alternatives such as distance-to-centroid and naive last-frame selection (Benavent-Lledo et al., 29 Jan 2026).

To enhance resilience to imperfect action histories, AAG⁺ introduces stochastic corruption—dropout, Gaussian noise, and random action swaps—of history embeddings during training. This yields robustness to recognition errors and history prediction noise without reliance on unrealistically clean context (Benavent-Lledo et al., 29 Jan 2026).

On large-scale procedural datasets, single-frame AAG⁺ approaches or surpasses video-based approaches both in accuracy and sample efficiency:

Method	Modalities	# Frames	IKEA-ASM Top-1/5	Meccano Top-1/5	Assembly101 Mean Recall@5
RULSTM	RGB	14	26.4 / 70.1	24.1 / 58.2	1.00
AVT	RGB	10	27.1 / 69.7	27.4 / 53.4	17.0
VLMAH	RGB+AH	8	52.3 / 85.4	29.1 / 57.1	25.1
AAG⁺	RGB-D+AH	1	51.3 / 88.9	27.2 / 60.4	13.0

Ablations confirm that while RGB+depth suffice for low-variance assembly, in high-variance scenarios, semantic action history is decisive. For example, on IKEA-ASM, predicted history + RGB + depth achieves 44.7/82.9 Top-1/5; ground-truth histories further boost performance. On Assembly101, even the best single-frame setup lags behind long-horizon video models, highlighting the remaining role of temporal modeling (Benavent-Lledo et al., 2 Dec 2025, Benavent-Lledo et al., 29 Jan 2026).

6. Relationship to Glimpse-based and Video Aggregation Methods

AAG occupies a distinct point in the action anticipation spectrum. Whereas glimpse-based models such as GliTr (Rangrej et al., 2022) actively select informative spatial patches across time, observing small regions per frame but always leveraging multiple frames, AAG relies on a single selected frame plus multimodal cues. GliTr employs spatiotemporal consistency losses to distill global knowledge into a glimpse-only agent, whereas AAG/AAG⁺ capitalize on cross-modal fusion and semantic priors to obviate the need for explicit sequence modeling—except in high-entropy, long-horizon contexts.

AAG achieves superior compute efficiency: one frame per anticipation (202M total, 24M trainable params) compared to 8–37 frames in temporal aggregation baselines and with comparable or better accuracy in structured settings (Benavent-Lledo et al., 2 Dec 2025).

7. Limitations, Insights, and Future Directions

AAG's superiority is most pronounced in domains with structured task flow and low ambiguity, such as IKEA-ASM and Meccano, where much of the action anticipation task reduces to spatial context recognition and semantic sequence priors. Video aggregation becomes indispensable as action order entropy increases or visual cues become less informative (e.g., Assembly101). Key insights:

Multimodal complementarity: Depth aids spatial disambiguation; action history encodes procedural context when future actions are under-determined.
Fusion design: Adaptive, bidirectional cross-attention with gating outperforms naive concatenation or self-attention.
Keyframe selection: Blur-based selection mitigates the impact of occlusion and motion artifacts in real-world and egocentric data.
Robustness: Stochastic corruption of history embeddings precludes overfitting to idealized context, yielding practical resilience.

Potential extensions include end-to-end fine-tuning of the depth and text encoders, dynamic glimpse or keyframe selection modules, and incorporating further modalities (pose estimation, audio signals). For truly unstructured or highly ambiguous scenarios, future research is directed towards hybrid frameworks that combine glimpse-based sequence modeling and robust multimodal fusion (Rangrej et al., 2022, Benavent-Lledo et al., 2 Dec 2025, Benavent-Lledo et al., 29 Jan 2026).

References: (Rangrej et al., 2022, Benavent-Lledo et al., 2 Dec 2025, Benavent-Lledo et al., 29 Jan 2026)