EPIC-KITCHENS: Egocentric Video Dataset
- EPIC-KITCHENS is a large-scale, richly annotated egocentric video dataset capturing unscripted kitchen activities with over 55 hours of footage and dense action labels.
- It employs unique annotation protocols including live narration and crowdsourcing to ensure precise temporal and spatial alignment of actions and objects.
- The dataset underpins benchmarks for action recognition, object detection, and anticipation, driving research in multimodal fusion and real-world model deployment.
EPIC-KITCHENS is the leading large-scale, richly annotated egocentric video dataset focused on everyday object interactions and fine-grained human actions in kitchen environments. Initiated in 2018 and subsequently extended, it supports rigorous research in first-person computer vision with a unique emphasis on unscripted activities, multimodal perception, and task generalization. Its core design, annotation protocols, and evolving challenge structure position it as a primary benchmark for both granular action understanding and robust, real-world model deployment across egocentric vision tasks.
1. Dataset Construction and Annotation Protocol
EPIC-KITCHENS was collected over a six-month period (May–November 2017) by 32 volunteers of ten different nationalities across four cities—Bristol (UK), Toronto (Canada), Catania (Italy), and Seattle (USA). Participants wore head-mounted GoPro cameras (linear FOV, full HD, ≈60 fps), activating recording upon kitchen entry and stopping upon departure, with no imposed scripts or recipes. The dataset captures naturalistic daily activities, including meal preparation, cleaning, and multi-tasking, resulting in 55 hours of continuous video (11.5 million frames), 39,596 temporally annotated action segments, and 454,158 “active” object bounding boxes (Damen et al., 2020).
Annotation employed a two-stage pipeline:
- Narration for Intention Capture: Immediately post-recording, participants provided spoken "live commentary" of actions (e.g., “take cup,” “pour milk”), in one of five languages, emphasizing present-tense, verb–object phrases to reflect intention.
- Crowdsourcing: Transcription and translation tasks on Amazon Mechanical Turk produced time-stamped captions. Separate annotation phases provided (i) precise start/end times for each narrated segment and (ii) bounding boxes for objects in up to two seconds before/after each action. Inter-annotator agreement was enforced via intersection-over-union (IoU) thresholds, with final segment boundaries consolidated for IoU > 0.5.
Classes are defined by manual clustering and disambiguation, resulting in 125 verb classes and 331 noun classes, forming the basis for verb–noun action composites.
Subsequent major releases—most notably EPIC-KITCHENS-100 (Damen et al., 2020)—expanded the dataset to 100 hours, 21 million frames, ~90,000 actions, and 700 variable-length videos across 97 unique kitchens, employing the "pause-and-talk" protocol to densify and correct narrations (+54% actions/min, +128% segments relative to the prior version). All data splitting ensures no overlap of environments between train and test.
2. Dataset Structure and Multimodal Extensions
The dataset is organized as temporally continuous, untrimmed video sequences with dense action and object labels. Key modalities include:
- Visual: Full-resolution RGB, TV-L1 optical flow, and segmented frames.
- Audio: Audio streams aligned with video, facilitating acoustic action cues; supplemental labeling provided in the EPIC-SOUNDS extension (Huh et al., 2023).
- Pixel-level Masks: EPIC-KITCHENS VISOR augments the dataset with ~272K manually drawn semantic masks, 9.9M interpolated dense masks, and 67K hand–object relations across 257 classes (including left–right hands, gloves, feet), supporting rigorous video object segmentation and relation reasoning (Darkhalil et al., 2022).
Annotation files are provided in standardized formats (JSON, .csv, per-frame boxes, per-action segments), with all splits and protocols strictly enforced to support reproducibility and fair benchmarking.
3. Core Benchmarks and Evaluation Methodologies
Three canonical supervised benchmarks, with consistent evaluation protocols, underpin EPIC-KITCHENS research:
- Object Detection: Active object localization across all noun classes, with mean Average Precision (mAP) computed at IoU thresholds {0.05, 0.50, 0.75}. The metric is defined as where is area under the precision–recall curve for class c. Separate reporting for many-shot (≥100 boxes) and few-shot (10–99 boxes) classes highlights data imbalance effects (Damen et al., 2020).
- Action Recognition: Classification of trimmed segments to (verb, noun) pairs. Aggregate (top-k) and per-class mean precision/recall are standard. Top-k is .
- Action Anticipation: Prediction of the upcoming (verb, noun) given a preceding video window ending τ_a=1s before onset. Performance is measured as in action recognition.
EPIC-KITCHENS-100 introduced additional tasks: weakly-supervised recognition (using single timestamps), temporal action detection (mAP at multiple IoU), cross-modal retrieval (text–video ranking; NDCG), and unsupervised domain adaptation between datasets collected in different years (Damen et al., 2020).
Test splits are strictly by kitchen: Seen (S1) uses train/test from the same environments; Unseen (S2) holds out all sequences from new kitchens, severely increasing challenge due to domain shift and "zero-shot" label combinations.
4. Baseline Models, Quantitative Results, and Model Insights
Strong baselines are established across tasks:
- Detection: On the original dataset, Faster R-CNN (ResNet-101) achieves [email protected] of 67.6% (S1), 62.9% (S2), but few-shot classes perform below 10% AP. For high-precision (IoU>0.75), overall mAP saturates near 20% (Damen et al., 2020).
- Recognition: Temporal modeling methods consistently outperform per-frame or purely segmental approaches. On S1 (EPIC-KITCHENS-100), TRN yields top-1 accuracy of 76% (verb), 56% (noun), and 46% (action); weakly-supervised models perform substantially worse (Damen et al., 2020). Temporal Shift Module (TSM), What-Where-When attention (W3), and video transformers like ViViT (Huang et al., 2021) further improve accuracy, most notably in noun prediction. For ViViT-B/16×2, top-1 action is 47.4%, noun 59.6%, verb 68.4% on the validation set.
- Anticipation: Anticipating from past context remains difficult, with top-1 action accuracy typically under 8% on S1, and halved on S2; performance may deteriorate as context length increases beyond 1s (Damen et al., 2020).
Model analysis reveals transformers excel at object (noun) classification via global self-attention but lag on verbs capturing fine-grained motion. Ensembles of CNNs and transformers, or explicit context fusion (e.g., Long-term Feature Bank), improve performance (Huang et al., 2021). The compositional action label space exhibits severe class imbalance, with the majority of verb–noun pairs being long-tail or unseen in test environments (Price et al., 2019).
5. Multimodal and Pixel-level Reasoning
EPIC-KITCHENS explicitly encourages multimodal research:
- Audio–Visual Fusion: Wearable audio streams and narrated intention are crucial for action disambiguation (e.g., "opening tap" vs. "closing tap" may be visually similar but diverge in sound). Baseline results show that fusing RGB, flow, and audio achieves higher accuracy than any single modality (Damen et al., 2020).
- EPIC-SOUNDS: Provides 78K categorized audio event segments mapped to 44 classes, enabling dedicated audio-only and multimodal action understanding; fine-tuned transformers (SSAST) reach 53.7% top-1 accuracy on these tasks (Huh et al., 2023).
- VISOR: Enables dense segmentation with over 271K manual masks and 9.9M interpolated masks, with 257 entity classes and 67K hand–object relations. Three VISOR benchmarks—VOS, Hand–Object Segmentation, and "Where Did This Come From?" tracked-object provenance—extend evaluation to pixel-level and temporal relation tasks (Darkhalil et al., 2022).
6. Applications, Limitations, and Research Opportunities
EPIC-KITCHENS supports research in assistive robotics (anticipating required utensils), health and nutrition monitoring, augmented reality cooking assistance, and safety-aware smart-home systems (Damen et al., 2020).
Limitations include:
- Domain Bias: All activity is confined to home kitchens; cross-domain generalization is untested.
- Annotation Noise: Some narrations are incomplete or belated, and only "active" objects (in manipulation) are annotated.
- Long-tail Distribution: Many verb–noun classes are extremely rare; current models underperform in few-shot and zero-shot settings.
- Environmental Shift: Generalization to unseen kitchens remains unsolved, as evidenced by steep accuracy drops from S1 to S2.
Future work is motivated by these limitations: domain adaptation, robust long-tail learning, few-shot and zero-shot action recognition, incorporation of richer modalities (audio, pixel masks), explicit modeling of intention, and more sophisticated context and relational reasoning (Darkhalil et al., 2022, Damen et al., 2020, Price et al., 2019).
7. Data Access, Community Standards, and Reproducibility
The dataset is distributed under CC BY-NC 4.0, with raw videos, frames, and annotations available via the University of Bristol data repository and epic-kitchens.github.io. All research is expected to follow prescribed train/test splits, explicit reporting standards (including metric breakdowns and class-specific results), and citation protocols. Pre-training on external datasets (e.g., ImageNet, COCO, Kinetics) is permitted but must be clearly stated, and real-time inference considerations are encouraged for deployable systems (Damen et al., 2018, Damen et al., 2020).
Dataset structure and leaderboards are tightly versioned, with ongoing coordinated "challenges" and public baselines fostering transparent, reproducible progress across all facets of egocentric video and multimodal action understanding.