HOMAGE: Home Action Genome Dataset

Updated 7 June 2026

HOMAGE is a large-scale, multi-view, multi-modal dataset that captures detailed home activities through synchronized videos and sensor data.
It provides hierarchical annotations including high-level activities, fine-grained atomic actions, and dense scene graphs for precise modeling.
The dataset underpins tasks like activity recognition and few-shot action learning, benchmarked using the innovative CCAU framework.

The Home Action Genome (HOMAGE) is a large-scale, multi-view, multi-modal video dataset designed for fine-grained modeling of activities in home environments through hierarchical, compositional, and spatio-temporal representations. HOMAGE provides synchronized videos from multiple viewpoints and modalities, annotated with both high-level activity labels, fine-grained temporally localized atomic actions, and dense dynamic scene graphs encoding interacting objects and their relationships. This resource supports research in hierarchical action understanding, compositional representation learning, few-shot action recognition, and spatio-temporal scene graph prediction, and serves as the testbed for the Cooperative Compositional Action Understanding (CCAU) framework (Rai et al., 2021).

1. Dataset Composition, Structure, and Modalities

HOMAGE comprises 1,752 “action sequences” recorded by 27 human actors performing daily activities in two fully furnished homes, covering kitchens, bedrooms, living rooms, bathrooms, and laundry rooms. Each sequence generates at least three temporally synchronized video streams: one egocentric (head-mounted) RGB view and two to four fixed third-person RGB cameras per sequence. RGB data are captured at 30 fps and 1280×720 resolution (downsampled to 128×128 for experiments).

Beyond video, a total of 12 sensor modalities are provided, all time-synchronized via I²C: 8×8 infrared thermal, audio (log-Mel spectrogram), ambient light, RGB light spectrum, passive infrared (PIR) human presence detection, environmental (CO₂, humidity, pressure, temperature), and full 3-axis IMU suite (accelerometer, gyroscope, magnetometer), each sampled at sensor-specific rates. This extensive sensing suite supports research into multi-modal, multi-view, and joint modality action representations (Rai et al., 2021).

2. Annotation Protocols and Hierarchical Labeling

Each action sequence in HOMAGE is densely annotated along a hierarchical, compositional axis:

High-level activities: One sequence-level label per clip, selected from 75 classes (e.g., “do laundry”, “make bed”).
Atomic actions: 453 distinct atomic-action classes are temporally localized in each sequence, each with explicit start and end frames. Atomic actions can overlap in time, allowing the same frame to carry multiple labels. The dataset contains 20,039 atomic-action instances (train), 2,062 (test1), and 2,468 (test2).
Scene composition (scene graphs): For each sequence, one third-person view is densely annotated with dynamic scene graphs. Annotation is performed on 3 uniformly sampled frames per atomic-action segment if <3s, else 5. Each annotated frame includes bounding boxes for the actor and every object they interact with (86 object categories, excluding “person”) and one of 29 relationship predicates (“holding,” “in front of”, etc.) per actor-object pair. There are approximately 497,534 object bounding boxes and 583,481 relationship annotations across the dataset.

Annotations are subject to detailed protocols: each frame is annotated with both bounding boxes and relationship labels, utilizing context from the 5s video clip to disambiguate repeated or similar objects (“which cup is being drunk from”). The protocols are modeled on prior large-scale scene graph datasets but carried out specifically for the multi-view, home-action context (Rai et al., 2021).

3. Dataset Splits, Access, and Preprocessing

HOMAGE is split into 1,388 training sequences, with two test sets of 198 and 166 sequences, ensuring all modalities and locations are represented in both splits. RGB data is processed at 128×128. Video frames are sampled at 1/3rd the framerate, grouped into eight blocks of five consecutive frames each (yielding ≈4-second windows), targeting 5,700 total multi-view clips. Preprocessing includes concurrency control, frame alignment, and modality synchronization.

Scene graphs are constructed per sampled frame, with a flattened binary matrix representing objects, relationships, and their co-occurrence, forming the input feature for models leveraging graph context. The licensing model is CC BY 4.0, with dataset access at https://action-genome.stanford.edu (Rai et al., 2021).

4. Model Architectures: CCAU Framework

The Cooperative Compositional Action Understanding (CCAU) framework is a joint, cooperative, multi-modal, multi-view encoder responsible for learning compositional representations of activities and atomic actions:

Modality-specific encoder backbones:
- RGB (egocentric and third-person views): 3D-ResNet-18, with only the last two residual stages employing 3D convolutions. Input is sampled in 8×5 frame blocks, with each block $z_j \in \mathbb{R}^{4 \times 4 \times 256}$ .
- Audio: log-Mel spectrogram through a VGG-19–style ConvNet, yielding $c_\text{audio} \in \mathbb{R}^{256}$ .
- (Optional) Scene graphs: flattened object-relationship matrices are passed through a small MLP.
Temporal aggregation: For each block, a ConvGRU (kernel size 1×1; shared spatially) computes temporally aggregated features $c_j = g(z_1, \dots, z_j)$ , with dropout $p=0.1$ .
Contrastive alignment: Cooperative contrastive learning between all modality pairs is achieved through a Noise-Contrastive Estimation loss:

$L_\text{align}^{(m, m')} = -\sum_i \log \frac{\exp(c_i^m \cdot c_i^{m'})}{\sum_j \exp(c_i^m \cdot c_j^{m'})}$

with $L_\text{align} = \sum_{m \neq m'} L_\text{align}^{(m, m')}$ .

Compositional classification heads:
- High-level activity classifier: one-hot cross-entropy loss $L_v$ on the pooled $c_N$ .
- Atomic action classifier: multi-label, binary cross-entropy loss $L_a$ (over all 453 classes, per block).
- Combined loss: $L_\text{composition} = L_v + \lambda L_a$ with $c_\text{audio} \in \mathbb{R}^{256}$ 0 (or uncertainty-weighted alternatives).
Spatial attention module (optional): Predicts attention scores $c_\text{audio} \in \mathbb{R}^{256}$ 1 over $c_\text{audio} \in \mathbb{R}^{256}$ 2 locations. Weighted feature aggregation via Softmax pooling over grid.
Full loss: $c_\text{audio} \in \mathbb{R}^{256}$ 3. Knowledge distillation ( $c_\text{audio} \in \mathbb{R}^{256}$ 4) ablations are also included.

Only a single modality is required at inference; all modalities are used for cooperative training. Dropout $c_\text{audio} \in \mathbb{R}^{256}$ 5 is applied before task heads. Optimization follows standard practices (Adam), and self-supervised Dense Predictive Coding on multi-view data provides additional pretraining signal (Rai et al., 2021).

5. Supported Tasks and Evaluation Metrics

HOMAGE enables and standardizes a variety of tasks:

Hierarchical Activity Recognition: Predict both the high-level activity label (single-label classification; accuracy metric) and sets of temporally-overlapping atomic action labels (multi-label classification; support-weighted mAP).
Few-shot Action Recognition: Novel classes (15 held out) are used for few-shot protocols. Linear classifiers are trained on CCAU embeddings of $c_\text{audio} \in \mathbb{R}^{256}$ 6 examples per class; mAP is reported for novel classes.
Multi-modal, Multi-view Representation Learning: Comparative performance of ego, third-person, and audio modalities, both independently and with cooperative training, is reported for both classification accuracy and atomic-action mAP.
Scene Graph-based Understanding: Dense compositional scene graphs allow models to utilize object-relationship features for both standard and “oracle” reasoning (where ground-truth graphs are given to the recognizer).
Self-supervised Representation Learning: Multiview Dense Predictive Coding–style pretraining is evaluated for mAP boost vs. fully supervised counterparts.

The following table summarizes selected evaluation metrics:

Task	Metric	Typical Result (Ego)
High-level activity	Accuracy (%)	34.9% (CCAU)
Atomic action	Support-weighted mAP (%)	29.3% (CCAU)
Few-shot (atomic, 1-shot)	mAP (%)	28.6% (CCAU)
Few-shot (atomic, 20-shot)	mAP (%)	49.4% (CCAU)
Audio only (high-level)	Accuracy (%)	33.3% (CCAU, vs. 28.5% base)
Oracle scene-graph (top-1)	Accuracy (%)	≈76%
Oracle scene-graph (top-3)	Accuracy (%)	≈91.7%

The combination of compositional and cooperative objectives yields the strongest performance across modalities and both full-data and few-shot protocols. The scene-graph “oracle” results indicate the possible upper bound of accuracy if scene graphs were perfectly predicted (Rai et al., 2021).

6. Experimental Findings and Significance

Experimental results indicate that cooperative multi-modal learning substantially improves activity and atomic-action recognition across all modalities, with ego-view accuracy increasing from 31.3% to 37.7% and atomic-action ego mAP from 20.5% to 28.5%. Audio and third-person modalities also substantially benefit (+4.8% and +2.9% in high-level accuracy, respectively). Integration of spatial attention further boosts performance (e.g., ego accuracy from 32.5% to 34.8%).

Compositional modeling (jointly training high-level and atomic-action heads) yields modest gains over non-compositional models, while the combination of compositional and cooperative learning (CCAU) consistently yields the highest performance (e.g., ego accuracy: 34.9% vs 32.1% without cooperation).

In few-shot recognition of atomic actions (ego-view), CCAU provides strong gains over baselines, with 1-shot mAP improving from 22.4% (baseline) to 28.6%, and 20-shot from 40.6% to 49.4%. Self-supervised pretraining gives an additional 1.3–3.0% boost in downstream mAP when fine-tuned. An “oracle” CCAU supplied with gold scene graph features achieves ≈76% (top-1) and 91.7% (top-3) high-level accuracy, demonstrating the practical utility of rich scene-graph annotation (Rai et al., 2021).

7. Role in the Research Landscape

HOMAGE is the first dataset to jointly provide: (1) multi-view (ego + third-person), (2) multi-modal (12 sensor types), (3) hierarchical (sequence-level and compositional atomic actions), and (4) dense, temporally-localized spatio-temporal scene graph annotations for activities in the home environment. Its unique structure enables research into cooperative, compositional, and multi-modal representation learning, as well as benchmarking progress in scene graph prediction in video. The CCAU framework establishes baselines and design patterns for leveraging cooperative alignment and compositional modeling, both in standard and data-sparse settings (Rai et al., 2021).

A plausible implication is that future datasets and models for activity analysis in other domains (e.g., industrial, healthcare) will draw on HOMAGE’s integration of dense scene composition and cooperative multi-modal modeling, as well as its rigorous task definitions and evaluation protocols.

Markdown Report Issue Upgrade to Chat

References (1)

Home Action Genome: Cooperative Compositional Action Understanding (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Home Action Genome (HOMAGE).