Ego4D-Haystack: Episodic Memory Video Benchmark
- Ego4D-Haystack dataset is a richly annotated benchmark of egocentric videos designed to support episodic memory retrieval through visual, natural language, and action-based queries.
- It comprises approximately 1,000 annotated hours from a multi-modal 3,670-hour corpus, partitioned into disjoint train, validation, and test splits across diverse real-world scenarios.
- The dataset establishes baseline metrics for retrieval tasks and drives advances in first-person video understanding while adhering to strict privacy and de-identification protocols.
The Ego4D-Haystack dataset is a large-scale, episodic memory retrieval benchmark and richly annotated egocentric video corpus explicitly designed to address the problem of querying long, first-person video records for information about the past visual experience of the camera wearer. Drawn from the broader Ego4D initiative, Haystack offers structured benchmarks for visual, language, and action-based queries. The dataset and its suite of tasks enable the academic community to systematically evaluate methods for retrospective retrieval within multi-hour, lifelog video streams, representing a substantial advance in the resources available for first-person vision research (Grauman et al., 2021).
1. Motivation and Task Formulation
The central objective of Ego4D-Haystack is to support the development of systems capable of indexing and retrieving from a camera wearer's egocentric video "memory." The challenge is cast as a retrieval problem over continuous, multi-hour video, where queries take three canonical forms:
- Visual Query (VQ): "Where did I last see this object?" The input comprises a static image crop from a video up to time ; the desired output is a contiguous track of bounding boxes corresponding to the last appearance of the object.
- Natural Language Query (NLQ): "What did I put in the drawer?" The input is a free-form natural language question , and the output is a temporal segment where the answer appears.
- Moments Query (MQ): "When did I set the table?" The input is an action category from a fixed taxonomy; the output consists of all video segments where the action occurs.
This retrieval paradigm is intended to benchmark progress toward AI systems that support human-like episodic memory access in wearable devices (Grauman et al., 2021).
2. Dataset Composition
Haystack is assembled from the narrated portions of the full Ego4D corpus, spanning approximately 1,000 annotated hours out of the larger 3,670-hour Ego4D video dataset. The composition by query type is as follows:
| Task | Annotated Hours | # Clips | Other Properties |
|---|---|---|---|
| Visual Query 2D | 432.9 h | 5,831 | 54 scenarios; +3D loc. (13 h/159 clips) |
| NLQ | 227.1 h | 1,659 | 34 scenarios |
| Moments Q | 328.7 h | 2,522 | 5 scenarios, 110 action categories |
Participants are drawn from 931 unique camera wearers, with annotated clips spanning 10 partner institutions, 74 cities, and 9 countries. Modalities include: RGB video (full 3,670 h), audio (2,535 h), 3D environmental meshes from 7 locations (491 h), stereo (80 h), eye-gaze (45 h), IMU (836 h), synchronized multi-cam (224 h), face-unblurred (612 h), and precomputed SlowFast features (3,670 h). Scenarios sampled reflect diverse daily activities in household, outdoor, workplace, leisure, and occupational contexts (Grauman et al., 2021).
3. Annotation Protocol and Indexing Structure
Annotation protocols are tailored per retrieval task type:
- Visual Queries: For each 5–16 minute clip, annotators select three non-trivially "interesting" objects, designate a query frame , provide an object crop , and annotate the last occurrence with a temporal response track , each . The 3D localization extension adds Matterport-derived 3D bounding boxes.
- Natural Language Queries: On 8–20 minute clips, annotators select and paraphrase one of 13 query templates (e.g., "Where did I put X before event Y?"), then mark a single contiguous response window 0 denoting when the answer occurs.
- Moments Queries: Over 8 minute clips, annotators exhaustively label all instances of 110 action categories mined from narrations. Each instance is specified as 1.
This design grounds Haystack in a structured, high-coverage search space for episodic retrieval (Grauman et al., 2021).
4. Dataset Partitioning
All tasks utilize disjoint train, validation, and test splits at the video level. No clip appears in more than one partition. The partitioning is as follows:
Visual Query (VQ-2D):
| Split | Hours | Clips | Queries |
|---|---|---|---|
| Train | 262 | 3,600 | 13,600 |
| Val | 87 | 1,200 | 4,500 |
| Test | 84 | 1,100 | 4,400 |
3D localization subset: train (19 h/164 clips), val (5 h/44 clips), test (9 h/69 clips).
Natural Language Query:
| Split | Hours | Clips | Queries |
|---|---|---|---|
| Train | 136 | 1,000 | 11,300 |
| Val | 45 | 300 | 3,900 |
| Test | 46 | 300 | 4,000 |
Moments Query:
| Split | Hours | Clips | Instances |
|---|---|---|---|
| Train | 195 | 1,486 | 13,600 |
| Val | 68 | 521 | 4,300 |
| Test | 63 | 481 | 4,300 |
5. Baseline Benchmarks and Metric Results
Baseline experiments reported for the Haystack tasks employ both detection-based trackers and span/localizer models. Key results include:
- Visual Query (2D Localization):
- Siam-RCNN+PF: Succ@tIoU>0.05: 32.4%; [email protected]: 0.14; [email protected]: 0.06; Rec: 13.2%
- Siam-RCNN+KYS (simple head): Succ: 33.0%; [email protected]: 0.15; [email protected]: 0.08; Rec: 27.2%
- Siam-RCNN+KYS (residual head): Succ: 39.8%; [email protected]: 0.20; [email protected]: 0.12; Rec: 32.2% (Test: Succ: 41.6%; [email protected]: 0.21; Rec: 34.0%)
- Visual Query (3D Localization):
- Siam-RCNN+KYS+DPT (depth): RMSE ≈6.0 m, angular error ≈1.60 rad, Success@6×inter-annotator: 30–36%
- Natural Language Query:
- 2D-TAN: Recall@1, tIoU=0.3 ≈5.0%; Recall@5: 12.9%. Recall@1, tIoU=0.5: 2.0%; Recall@5: 5.9%
- VSLNet: Recall@[email protected]: 5.5%; Recall@[email protected]: 10.7%; tIoU=0.5: 3.1% (Recall@1), 6.6% (Recall@5)
- Moments Query:
- mAP@tIoU = {.1, .2, ..., .5} (ActivityNet style): Val {9.1, 7.2, 5.8, 4.6, 3.4} (avg 6.0%); Test {8.6, 6.5, 5.4, 4.3, 3.6} (avg 5.7%)
- Recall@1×.3: Val 33.5%, Test 33.6%; Recall@1×.5: Val 25.2%, Test 24.3%
These benchmarks indicate the current performance gap relative to human-level episodic recall and provide a baseline for future algorithmic advances (Grauman et al., 2021).
6. Privacy, Consent, and De-identification Protocols
Data capture for Ego4D-Haystack strictly adheres to institutional IRB protocols or equivalents, including informed consent from all camera wearers. Participants retain the option to redact or withdraw footage. The annotation pipeline systematically omits private spaces and sensitive activities.
There are two data access tiers:
- Faces-unblurred clips: Provided only for participants who explicitly consented to share unblurred visuals, used in audio-visual and social tasks.
- Public/interim clips: All incidental persons or sensitive PII are processed for anonymization through a semi-automatic pipeline combining automatic face and license plate detection (brighter.ai, Primloc Secure Redact) with manual review and pixel-level blurring.
All egocentric streams are de-identified to GDPR-like standards and licensed for non-commercial, research-exclusive use.
7. Significance and Research Trajectory
Ego4D-Haystack establishes the first rigorously annotated, large-scale benchmark suite for episodic memory retrieval in egocentric video, supporting research in multimodal search, long-term activity localization, and first-person audiovisual understanding. By encompassing diverse global contexts and strictly governed privacy protocols, Haystack enables direct progress toward robust, ethical wearable AI systems capable of sophisticated memory indexing and retrieval. A plausible implication is that future advancements in core video understanding tasks—in particular, those at the intersection of retrieval, natural language, and action segmentation—will be measurable and comparable due to the structured challenge format and public data availability (Grauman et al., 2021).