Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ego4D-Haystack: Episodic Memory Video Benchmark

Updated 3 April 2026
  • Ego4D-Haystack dataset is a richly annotated benchmark of egocentric videos designed to support episodic memory retrieval through visual, natural language, and action-based queries.
  • It comprises approximately 1,000 annotated hours from a multi-modal 3,670-hour corpus, partitioned into disjoint train, validation, and test splits across diverse real-world scenarios.
  • The dataset establishes baseline metrics for retrieval tasks and drives advances in first-person video understanding while adhering to strict privacy and de-identification protocols.

The Ego4D-Haystack dataset is a large-scale, episodic memory retrieval benchmark and richly annotated egocentric video corpus explicitly designed to address the problem of querying long, first-person video records for information about the past visual experience of the camera wearer. Drawn from the broader Ego4D initiative, Haystack offers structured benchmarks for visual, language, and action-based queries. The dataset and its suite of tasks enable the academic community to systematically evaluate methods for retrospective retrieval within multi-hour, lifelog video streams, representing a substantial advance in the resources available for first-person vision research (Grauman et al., 2021).

1. Motivation and Task Formulation

The central objective of Ego4D-Haystack is to support the development of systems capable of indexing and retrieving from a camera wearer's egocentric video "memory." The challenge is cast as a retrieval problem over continuous, multi-hour video, where queries take three canonical forms:

  • Visual Query (VQ): "Where did I last see this object?" The input comprises a static image crop vv from a video up to time qq; the desired output is a contiguous track r={rs,...,re}r = \{r_s, ..., r_e\} of bounding boxes corresponding to the last appearance of the object.
  • Natural Language Query (NLQ): "What did I put in the drawer?" The input is a free-form natural language question QQ, and the output is a temporal segment [ts,te][t_s, t_e] where the answer appears.
  • Moments Query (MQ): "When did I set the table?" The input is an action category cc from a fixed taxonomy; the output consists of all video segments where the action occurs.

This retrieval paradigm is intended to benchmark progress toward AI systems that support human-like episodic memory access in wearable devices (Grauman et al., 2021).

2. Dataset Composition

Haystack is assembled from the narrated portions of the full Ego4D corpus, spanning approximately 1,000 annotated hours out of the larger 3,670-hour Ego4D video dataset. The composition by query type is as follows:

Task Annotated Hours # Clips Other Properties
Visual Query 2D 432.9 h 5,831 54 scenarios; +3D loc. (13 h/159 clips)
NLQ 227.1 h 1,659 34 scenarios
Moments Q 328.7 h 2,522 5 scenarios, 110 action categories

Participants are drawn from 931 unique camera wearers, with annotated clips spanning 10 partner institutions, 74 cities, and 9 countries. Modalities include: RGB video (full 3,670 h), audio (2,535 h), 3D environmental meshes from 7 locations (491 h), stereo (80 h), eye-gaze (45 h), IMU (836 h), synchronized multi-cam (224 h), face-unblurred (612 h), and precomputed SlowFast features (3,670 h). Scenarios sampled reflect diverse daily activities in household, outdoor, workplace, leisure, and occupational contexts (Grauman et al., 2021).

3. Annotation Protocol and Indexing Structure

Annotation protocols are tailored per retrieval task type:

  • Visual Queries: For each 5–16 minute clip, annotators select three non-trivially "interesting" objects, designate a query frame qq, provide an object crop vv, and annotate the last occurrence with a temporal response track r={rs,...,re}r = \{r_s, ..., r_e\}, each ri=(x,y,w,h)r_i = (x, y, w, h). The 3D localization extension adds Matterport-derived 3D bounding boxes.
  • Natural Language Queries: On 8–20 minute clips, annotators select and paraphrase one of 13 query templates (e.g., "Where did I put X before event Y?"), then mark a single contiguous response window qq0 denoting when the answer occurs.
  • Moments Queries: Over 8 minute clips, annotators exhaustively label all instances of 110 action categories mined from narrations. Each instance is specified as qq1.

This design grounds Haystack in a structured, high-coverage search space for episodic retrieval (Grauman et al., 2021).

4. Dataset Partitioning

All tasks utilize disjoint train, validation, and test splits at the video level. No clip appears in more than one partition. The partitioning is as follows:

Visual Query (VQ-2D):

Split Hours Clips Queries
Train 262 3,600 13,600
Val 87 1,200 4,500
Test 84 1,100 4,400

3D localization subset: train (19 h/164 clips), val (5 h/44 clips), test (9 h/69 clips).

Natural Language Query:

Split Hours Clips Queries
Train 136 1,000 11,300
Val 45 300 3,900
Test 46 300 4,000

Moments Query:

Split Hours Clips Instances
Train 195 1,486 13,600
Val 68 521 4,300
Test 63 481 4,300

(Grauman et al., 2021)

5. Baseline Benchmarks and Metric Results

Baseline experiments reported for the Haystack tasks employ both detection-based trackers and span/localizer models. Key results include:

  • Visual Query (2D Localization):
  • Visual Query (3D Localization):
    • Siam-RCNN+KYS+DPT (depth): RMSE ≈6.0 m, angular error ≈1.60 rad, Success@6×inter-annotator: 30–36%
  • Natural Language Query:
    • 2D-TAN: Recall@1, tIoU=0.3 ≈5.0%; Recall@5: 12.9%. Recall@1, tIoU=0.5: 2.0%; Recall@5: 5.9%
    • VSLNet: Recall@[email protected]: 5.5%; Recall@[email protected]: 10.7%; tIoU=0.5: 3.1% (Recall@1), 6.6% (Recall@5)
  • Moments Query:
    • mAP@tIoU = {.1, .2, ..., .5} (ActivityNet style): Val {9.1, 7.2, 5.8, 4.6, 3.4} (avg 6.0%); Test {8.6, 6.5, 5.4, 4.3, 3.6} (avg 5.7%)
    • Recall@1×.3: Val 33.5%, Test 33.6%; Recall@1×.5: Val 25.2%, Test 24.3%

These benchmarks indicate the current performance gap relative to human-level episodic recall and provide a baseline for future algorithmic advances (Grauman et al., 2021).

Data capture for Ego4D-Haystack strictly adheres to institutional IRB protocols or equivalents, including informed consent from all camera wearers. Participants retain the option to redact or withdraw footage. The annotation pipeline systematically omits private spaces and sensitive activities.

There are two data access tiers:

  1. Faces-unblurred clips: Provided only for participants who explicitly consented to share unblurred visuals, used in audio-visual and social tasks.
  2. Public/interim clips: All incidental persons or sensitive PII are processed for anonymization through a semi-automatic pipeline combining automatic face and license plate detection (brighter.ai, Primloc Secure Redact) with manual review and pixel-level blurring.

All egocentric streams are de-identified to GDPR-like standards and licensed for non-commercial, research-exclusive use.

7. Significance and Research Trajectory

Ego4D-Haystack establishes the first rigorously annotated, large-scale benchmark suite for episodic memory retrieval in egocentric video, supporting research in multimodal search, long-term activity localization, and first-person audiovisual understanding. By encompassing diverse global contexts and strictly governed privacy protocols, Haystack enables direct progress toward robust, ethical wearable AI systems capable of sophisticated memory indexing and retrieval. A plausible implication is that future advancements in core video understanding tasks—in particular, those at the intersection of retrieval, natural language, and action segmentation—will be measurable and comparable due to the structured challenge format and public data availability (Grauman et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ego4D-Haystack Dataset.