Action Genome: Compositional Video Analysis

Updated 22 January 2026

Action Genome is a framework that formalizes video actions as dynamic configurations of human–object interactions using spatio-temporal scene graphs.
It integrates scene graph features with 3D CNNs, enhancing action recognition accuracy and establishing benchmarks for complex video analysis.
The approach supports tasks such as abductive past action inference and few-shot learning, offering robust and interpretable insights into temporal behaviors.

Action Genome formalizes the compositional structure of actions in video by representing activities as dynamic configurations of human–object interactions within spatio-temporal scene graphs. Motivated by cognitive and neuroscience findings that humans perceive activities hierarchically and parse them as interacting “action–object” units, Action Genome redefines standard action recognition pipelines by structurally decomposing television and household video into frame-level semantic graphs, enabling granularity in the understanding and inference of complex, temporally extended behaviors. The approach not only enhances supervised learning for action recognition but also establishes a benchmark for scene-graph prediction in the video domain and opens new models and tasks such as abductive past action inference and few-shot compositional recognition.

1. Action Genome Dataset and Spatio-temporal Representation

Action Genome introduces an annotation and modeling paradigm where each frame at time $t$ is represented by a scene graph $G_t = (O_t, R_t)$ , with $O_t$ the set of detected object nodes (including “person” and up to 35 interactable classes) and $R_t$ the set of edges (object–predicate–object relationships) from a 25-predicate vocabulary. The predicate set is partitioned into attention (e.g., “looking at”), spatial (e.g., “in front of”), and contact (e.g., “holding,” “sitting on”) types. Temporal dynamics are encoded by linking consecutive frame graphs, computing the set of relationships $\Delta R_t = R_{t+1} \setminus R_t$ to capture atomic relational transitions.

Dataset construction leverages the Charades corpus (10,000 videos, 157 action classes). Action intervals are uniformly sampled at 5 frames per interval, producing 234,253 annotated frames with a total annotation count of 476,229 bounding boxes (35 object classes plus “person”) and 1,715,568 relationships. The frequency distribution of objects (≥10,000 occurrences per class) and predicates (≥1,000 occurrences) ensures statistical density suitable for learning and evaluation (Ji et al., 2019).

2. Model Architectures and Scene Graph Feature Integration

To leverage the compositional structure, standard 3D CNN pipelines (e.g., SlowFast, I3D+Non-local) are augmented with a Scene Graph Feature Bank (SGFB). For each frame, a per-frame detector (Faster R-CNN) and a relational prediction head (RelDN) infer object and relationship scores. These are arranged into a confidence matrix $C_t \in \mathbb{R}^{|O| \times |\mathcal{R}|}$ and flattened to vector $f(G_t) \in \mathbb{R}^d$ ( $d = 35 \times 25 = 875$ ). An SGFB consisting of temporally stacked vectors $F_{SG} = [f(G_1),...,f(G_T)]$ is extracted for each video.

Fusion with the 3D CNN backbone is achieved via feature-bank operators (e.g., non-local, pooling), aggregating SGFBs into context vectors and concatenating or adding these to pooled spatio-temporal CNN features for final action prediction. This hierarchical aggregation pools atomic action–object interactions to high-level activity encodings (Ji et al., 2019).

3. Training Objectives and Loss Functions

Action Genome models optimize multi-task objectives:

Classification loss: For $m$ action classes, softmax-cross-entropy is used: $G_t = (O_t, R_t)$ 0.
Graph prediction loss: $G_t = (O_t, R_t)$ 1, where $G_t = (O_t, R_t)$ 2 is the standard detection loss for object boxes/classification and $G_t = (O_t, R_t)$ 3 is the multi-class loss for relationship classification per edge.
Combined loss: $G_t = (O_t, R_t)$ 4, with $G_t = (O_t, R_t)$ 5 parameters adjusted for each task (Ji et al., 2019).

4. Experimental Outcomes: Recognition and Benchmarking

On the Charades dataset, the SGFB-enhanced model achieves 44.3% mAP, exceeding the baseline LFB/SlowFast+NL benchmark at 42.5% mAP. Oracle experiments with ground truth scene graphs push this to 60.3% mAP, evidencing the upper bound impact of accurate relational modeling. In the few-shot regime, decomposing activity into compositional units is particularly advantageous: SGFB attains 42.7% mAP in the 10-shot case, compared to LFB’s 39.6%. The composition-driven approach, by pooling object–predicate interactions, enables significant data efficiency (Ji et al., 2019).

Action Genome also institutionalizes spatio-temporal scene graph prediction as a video understanding benchmark. Using the RelDN baseline, video-level Recall@20/50 is 48.8/48.98 (Predicate classification), 46.19/48.76 (Scene-graph classification), and 34.92/36.49 (Scene-graph detection). The modest drop in recall (1–2%) relative to static-image approaches suggests temporal modeling adds complexity, leaving opportunity for future improvement (Ji et al., 2019).

5. Extensions: Abductive Reasoning and Relational Inference

New research leverages Action Genome for abductive past action inference: given an observed set of human–object relations $G_t = (O_t, R_t)$ 6 from a single frame (“snapshot”), models predict the most likely set, sequence, or membership status of past actions $G_t = (O_t, R_t)$ 7 that could have led to the present scene (Tan et al., 2022). This task is modeled as $G_t = (O_t, R_t)$ 8 in the forward sense and inverted for abduction. Relational models such as R-GNNED and RBP, as well as fused bilinear graph models (BiGED), operationalize this problem.

Past Action Set Prediction: BiGED achieves mAP of 35.8% and Recall@10 of 60.6%, outperforming video-only, vision-language, and rule-based competitors (which all remain below 28% mAP). Human annotators achieve R@10 ≈ 80.6%, indicating significant headroom for model improvement.
Sequence Prediction: All models, including BiGED+GRU, reach ~10.5% accuracy, against a human upper bound of ~14%.
Verification: BiGED approach achieves mAP of 34.1% and R@10 of 57.4%.

These results reinforce that explicit scene graph representations substantially outperform holistic vision backbones for inferring latent causal structure in human activities (Tan et al., 2022).

6. Generalization: Home Action Genome and Multimodal, Hierarchical Compositions

The Home Action Genome (HOMAGE) dataset extends these ideas to multi-view, multi-modal recordings in realistic home environments, capturing 5,700 videos with 27 participants, 86 object categories, and 29 relation types, densely annotated in scene graphs. Hierarchical annotation includes both coarse activity (75 classes) and atomic action (453 categories) labels, supporting analysis of compositional structure at multiple temporal granularities (Rai et al., 2021).

Cooperative Compositional Action Understanding (CCAU) is a multi-modal framework trained with a combination of cross-modal alignment and compositional multi-task objectives. Across audio, ego-view, and third-person modalities, CCAU improves atomic-action mAP (ego: 29.3%, audio: 21.7%) over single-modality baselines and yields large gains (+6.2% in 1-shot and +8.8% in 20-shot few-shot recognition). This supports the assertion that compositionality, multi-modality, and spatial/temporal structure are complementary for robust action understanding (Rai et al., 2021).

7. Impact, Limitations, and Future Directions

Action Genome formalizes human action understanding as the analysis of spatio-temporal scene graph dynamics, delivering practical advances in supervised and few-shot recognition, abductive inference, and compositional generalization. Current limitations include reliance on ground truth or high-quality scene graph predictions—performance deteriorates if object/relationship annotations are noisy or generated. Sequencing and abductive inference remain challenging, with human performance indicating a natural upper bound not yet reached by models.

Prospective directions include:

Extending compositional graph reasoning to multi-frame and multi-agent scenarios
Exploiting unsupervised or semi-supervised annotation pipelines for scalability
Applying scene graph abduction to robotics, ambient intelligence, and explainable AI
Deepening the predicate ontology and temporal event logic to approach open-domain activity understanding

The Action Genome line has significantly influenced benchmarks, models, and methodological thinking in video understanding, providing a factual, compositional, and relationally grounded framework for action recognition research (Ji et al., 2019, Tan et al., 2022, Rai et al., 2021).