Ego4D-Haystack: Episodic Memory Video Benchmark

Updated 3 April 2026

Ego4D-Haystack dataset is a richly annotated benchmark of egocentric videos designed to support episodic memory retrieval through visual, natural language, and action-based queries.
It comprises approximately 1,000 annotated hours from a multi-modal 3,670-hour corpus, partitioned into disjoint train, validation, and test splits across diverse real-world scenarios.
The dataset establishes baseline metrics for retrieval tasks and drives advances in first-person video understanding while adhering to strict privacy and de-identification protocols.

The Ego4D-Haystack dataset is a large-scale, episodic memory retrieval benchmark and richly annotated egocentric video corpus explicitly designed to address the problem of querying long, first-person video records for information about the past visual experience of the camera wearer. Drawn from the broader Ego4D initiative, Haystack offers structured benchmarks for visual, language, and action-based queries. The dataset and its suite of tasks enable the academic community to systematically evaluate methods for retrospective retrieval within multi-hour, lifelog video streams, representing a substantial advance in the resources available for first-person vision research (Grauman et al., 2021).

1. Motivation and Task Formulation

The central objective of Ego4D-Haystack is to support the development of systems capable of indexing and retrieving from a camera wearer's egocentric video "memory." The challenge is cast as a retrieval problem over continuous, multi-hour video, where queries take three canonical forms:

Visual Query (VQ): "Where did I last see this object?" The input comprises a static image crop $v$ from a video up to time $q$ ; the desired output is a contiguous track $r = \{r_s, ..., r_e\}$ of bounding boxes corresponding to the last appearance of the object.
Natural Language Query (NLQ): "What did I put in the drawer?" The input is a free-form natural language question $Q$ , and the output is a temporal segment $[t_s, t_e]$ where the answer appears.
Moments Query (MQ): "When did I set the table?" The input is an action category $c$ from a fixed taxonomy; the output consists of all video segments where the action occurs.

This retrieval paradigm is intended to benchmark progress toward AI systems that support human-like episodic memory access in wearable devices (Grauman et al., 2021).

2. Dataset Composition

Haystack is assembled from the narrated portions of the full Ego4D corpus, spanning approximately 1,000 annotated hours out of the larger 3,670-hour Ego4D video dataset. The composition by query type is as follows:

Task	Annotated Hours	# Clips	Other Properties
Visual Query 2D	432.9 h	5,831	54 scenarios; +3D loc. (13 h/159 clips)
NLQ	227.1 h	1,659	34 scenarios
Moments Q	328.7 h	2,522	5 scenarios, 110 action categories

Participants are drawn from 931 unique camera wearers, with annotated clips spanning 10 partner institutions, 74 cities, and 9 countries. Modalities include: RGB video (full 3,670 h), audio (2,535 h), 3D environmental meshes from 7 locations (491 h), stereo (80 h), eye-gaze (45 h), IMU (836 h), synchronized multi-cam (224 h), face-unblurred (612 h), and precomputed SlowFast features (3,670 h). Scenarios sampled reflect diverse daily activities in household, outdoor, workplace, leisure, and occupational contexts (Grauman et al., 2021).

3. Annotation Protocol and Indexing Structure

Annotation protocols are tailored per retrieval task type:

Visual Queries: For each 5–16 minute clip, annotators select three non-trivially "interesting" objects, designate a query frame $q$ , provide an object crop $v$ , and annotate the last occurrence with a temporal response track $r = \{r_s, ..., r_e\}$ , each $r_i = (x, y, w, h)$ . The 3D localization extension adds Matterport-derived 3D bounding boxes.
Natural Language Queries: On 8–20 minute clips, annotators select and paraphrase one of 13 query templates (e.g., "Where did I put X before event Y?"), then mark a single contiguous response window $q$ 0 denoting when the answer occurs.
Moments Queries: Over 8 minute clips, annotators exhaustively label all instances of 110 action categories mined from narrations. Each instance is specified as $q$ 1.

This design grounds Haystack in a structured, high-coverage search space for episodic retrieval (Grauman et al., 2021).

4. Dataset Partitioning

All tasks utilize disjoint train, validation, and test splits at the video level. No clip appears in more than one partition. The partitioning is as follows:

Visual Query (VQ-2D):

Split	Hours	Clips	Queries
Train	262	3,600	13,600
Val	87	1,200	4,500
Test	84	1,100	4,400

3D localization subset: train (19 h/164 clips), val (5 h/44 clips), test (9 h/69 clips).

Natural Language Query:

Split	Hours	Clips	Queries
Train	136	1,000	11,300
Val	45	300	3,900
Test	46	300	4,000

Moments Query:

Split	Hours	Clips	Instances
Train	195	1,486	13,600
Val	68	521	4,300
Test	63	481	4,300

(Grauman et al., 2021)

5. Baseline Benchmarks and Metric Results

Baseline experiments reported for the Haystack tasks employ both detection-based trackers and span/localizer models. Key results include:

Visual Query (2D Localization):
- Siam-RCNN+PF: Succ@tIoU>0.05: 32.4%; [email protected]: 0.14; [email protected]: 0.06; Rec: 13.2%
- Siam-RCNN+KYS (simple head): Succ: 33.0%; [email protected]: 0.15; [email protected]: 0.08; Rec: 27.2%
- Siam-RCNN+KYS (residual head): Succ: 39.8%; [email protected]: 0.20; [email protected]: 0.12; Rec: 32.2% (Test: Succ: 41.6%; [email protected]: 0.21; Rec: 34.0%)
Visual Query (3D Localization):
- Siam-RCNN+KYS+DPT (depth): RMSE ≈6.0 m, angular error ≈1.60 rad, Success@6×inter-annotator: 30–36%
Natural Language Query:
- 2D-TAN: Recall@1, tIoU=0.3 ≈5.0%; Recall@5: 12.9%. Recall@1, tIoU=0.5: 2.0%; Recall@5: 5.9%
- VSLNet: Recall@[email protected]: 5.5%; Recall@[email protected]: 10.7%; tIoU=0.5: 3.1% (Recall@1), 6.6% (Recall@5)
Moments Query:
- mAP@tIoU = {.1, .2, ..., .5} (ActivityNet style): Val {9.1, 7.2, 5.8, 4.6, 3.4} (avg 6.0%); Test {8.6, 6.5, 5.4, 4.3, 3.6} (avg 5.7%)
- Recall@1×.3: Val 33.5%, Test 33.6%; Recall@1×.5: Val 25.2%, Test 24.3%

These benchmarks indicate the current performance gap relative to human-level episodic recall and provide a baseline for future algorithmic advances (Grauman et al., 2021).

Data capture for Ego4D-Haystack strictly adheres to institutional IRB protocols or equivalents, including informed consent from all camera wearers. Participants retain the option to redact or withdraw footage. The annotation pipeline systematically omits private spaces and sensitive activities.

There are two data access tiers:

Faces-unblurred clips: Provided only for participants who explicitly consented to share unblurred visuals, used in audio-visual and social tasks.
Public/interim clips: All incidental persons or sensitive PII are processed for anonymization through a semi-automatic pipeline combining automatic face and license plate detection (brighter.ai, Primloc Secure Redact) with manual review and pixel-level blurring.

All egocentric streams are de-identified to GDPR-like standards and licensed for non-commercial, research-exclusive use.

7. Significance and Research Trajectory

Ego4D-Haystack establishes the first rigorously annotated, large-scale benchmark suite for episodic memory retrieval in egocentric video, supporting research in multimodal search, long-term activity localization, and first-person audiovisual understanding. By encompassing diverse global contexts and strictly governed privacy protocols, Haystack enables direct progress toward robust, ethical wearable AI systems capable of sophisticated memory indexing and retrieval. A plausible implication is that future advancements in core video understanding tasks—in particular, those at the intersection of retrieval, natural language, and action segmentation—will be measurable and comparable due to the structured challenge format and public data availability (Grauman et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Ego4D: Around the World in 3,000 Hours of Egocentric Video (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ego4D-Haystack Dataset.

Ego4D-Haystack: Episodic Memory Video Benchmark

1. Motivation and Task Formulation

2. Dataset Composition

3. Annotation Protocol and Indexing Structure

4. Dataset Partitioning

5. Baseline Benchmarks and Metric Results

7. Significance and Research Trajectory

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Ego4D-Haystack: Episodic Memory Video Benchmark

1. Motivation and Task Formulation

2. Dataset Composition

3. Annotation Protocol and Indexing Structure

4. Dataset Partitioning

5. Baseline Benchmarks and Metric Results

6. Privacy, Consent, and De-identification Protocols

7. Significance and Research Trajectory

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research