Papers
Topics
Authors
Recent
2000 character limit reached

EgoEditData: Egocentric Video Edit Dataset

Updated 11 December 2025
  • EgoEditData is a curated collection of first-person video clips paired with natural language editing instructions, designed for AR applications.
  • It comprises 1700 source-prompt pairs from 100 unique video clips, capturing dynamic egomotion and intricate hand-object interactions.
  • The dataset’s evaluation protocols and real-time streaming benchmarks support robust assessments of instruction fidelity, visual quality, and temporal consistency.

EgoEditData is a carefully designed and manually curated dataset focused on instruction-guided editing of egocentric videos. Conceived to address challenges unique to first-person, interactive AR applications—such as rapid egomotion and pervasive hand-object interactions—EgoEditData forms the core data foundation for benchmarking and advancing egocentric video editing methods. Its development is part of the EgoEdit ecosystem, which also comprises a real-time, instruction-following egocentric video editor (EgoEdit) and a comprehensive evaluation suite (EgoEditBench) specifically tailored for the domain (Li et al., 5 Dec 2025).

1. Motivation and Distinction from Prior Video Editing Datasets

Egocentric video editing presents substantial domain gaps compared to third-person video data, notably due to highly dynamic camera motions and the central role of hand-object interactions in typical AR workflows. Most prior AI video editors and editing datasets predominantly address static or third-person views, leading to degraded performance or brittle results when transitioning to egocentric footage. Offline pipelines lack the necessary low-latency interaction for AR scenarios. EgoEditData was designed to explicitly capture these challenges, offering a curated testbed for instruction-following, temporally stable editing under first-person conditions (Li et al., 5 Dec 2025).

2. Dataset Structure and Content

The dataset underlying EgoEditBench, termed here as "EgoEditData" (Editor's term), comprises 1700 source–prompt pairs constructed from 100 unique egocentric video clips, each 5 seconds long at 16 fps and 512×384 px resolution. Each video is annotated with 15 distinct editing tasks, with additional source clips for specific object-editing variations. No partitioning into training, validation, or test splits is internally defined; EgoEditBench is strictly an evaluation-only suite. Each sample includes:

  • Source video frames
  • A natural-language instruction prompt generated via GPT-5 Mini, conditioned on the clip caption and visible objects
  • Auxiliary conditioning signals (as relevant): Canny edge maps, DWpose-generated 2D skeletons, and depth maps for cross-modal (X-to-Video) tasks

All source clips originate from Ego4D but are drawn from data outside any training exposure for EgoEdit models, ensuring strict held-out evaluation (Li et al., 5 Dec 2025).

3. Editing Task Taxonomy

EgoEditData encompasses a broad editing task taxonomy specifically crafted for challenges present in egocentric vision:

Task Type Description Example Domain Challenge Emphasis
Add Object Insert specified object Occlusion, interaction
Remove Object Remove object handled by hand Hand-object interaction
Change Object Replace, with/without effects Contextual manipulation
Change Background Alter scene background Viewpoint/occlusion
Change Camera Pose Induce pan/tilt/dolly/zoom Egomotion, temporal consistency
Add Effect / Stylization Global or style-based effects Consistent global transforms
Reasoning Context/temporal logic—disambiguation Ambiguous references
Depth/Sketch/Pose -to-Video Generate RGB from auxiliary (depth, edge, pose) Cross-modal generation
Video-to-Depth/Sketch/Pose Invert RGB to auxiliary modality Modality inversion
Combined Multi-task prompts Composite challenge

Tasks such as “Change Object” and “Remove Object” focus on objects actively manipulated by hands, while “Reasoning” tasks necessitate resolving spatial/temporal ambiguity, exemplified by instructions like “replace the cup I just set down.” This structure directly targets the motion patterns, viewpoint shifts, and occlusions typical in egocentric AR contexts (Li et al., 5 Dec 2025).

4. Evaluation Protocols and Metrics

EgoEditBench leverages four quantitative metrics adapted from EditVerseBench and computed per task, then averaged across tasks:

  • Instruction Faithfulness (VLM score):

VLM(v,c)=1Tt=1Tcos(Evid(vt),Etxt(c))VLM(v,c) = \frac{1}{T} \sum_{t=1}^{T} \cos( E_{vid}(v_t), E_{txt}(c) )

where EvidE_{vid} and EtxtE_{txt} are CLIP-based encoders, vtv_t is frame tt, and cc is the prompt.

  • Overall Quality (PickScore, PS): Video-frame CLIPScore aggregated across clip.
  • Text Alignment (TA): CLIPScore between a pooled video representation and the instruction.
  • Temporal Consistency (TC):

TC(v)=11T1t=1T1Euv[Flowtt+1(vt,u)vt+1(u)2]TC(v) = 1 - \frac{1}{T-1} \sum_{t=1}^{T-1} E_{uv} \left[ \| Flow_{t \to t+1}(v_t, u) - v_{t+1}(u) \|_2 \right]

employing an optical-flow estimator (RAFT) to assess frame-coherence. Approaching 100% indicates robust temporal alignment under motion.

Interaction and latency are also measured for streaming inference on a single NVIDIA H100 GPU (512×384 px, 16 fps). First-frame latency for EgoEdit-RT is 855 ms, with streaming throughput at 38.1 fps for a 9-frame chunk (Li et al., 5 Dec 2025).

5. Baselines and Comparative Results

Evaluation on EgoEditBench includes methods from attention manipulation, frame-propagation, instruction-guided editing, and streaming families. Table below summarizes core results (averaged VLM/PS/TA/TC across tasks):

Method VLM PS TA TC Family
TokenFlow 4.99 18.91 15.89 95.04 Attention manipulation
Señorita-2M 7.52 18.85 16.25 95.86 Frame-propagation*
EgoEdit 7.76 19.21 16.89 96.70 Instruction-guided
EgoEdit-RT 7.71 19.13 16.34 96.41 Streaming, autoregressive

*Frame-propagation methods receive EgoEdit’s first edited frame during evaluation for fairness. EgoEdit and EgoEdit-RT achieve the highest scores on instruction faithfulness, visual quality, and temporal consistency, particularly excelling in egocentric scenarios compared to established baselines. Streaming variants maintain competitive accuracy while offering real-time speeds (Li et al., 5 Dec 2025).

6. Qualitative Observations and User Alignment

Qualitative outcomes indicate that EgoEdit-RT robustly preserves hand identity and geometry, adapts to rapid egomotion, and manages complex, real-world manipulations such as inserting synthetic effects or augmenting objects in fast motion. Common baseline failures include inadequate or excessive edits, instability under occlusion, and flickering. A user-preference alignment paper across 450 samples and all 15 tasks reports agreement (~86%) between VLM-derived choices and human judgment, supporting VLM as a reliable automatic proxy for instruction faithfulness in evaluations (Li et al., 5 Dec 2025).

7. Significance and Implications for AR Editing Systems

EgoEditData, as deployed in EgoEditBench, enables standardized, comprehensive evaluation of egocentric video editing methods under real-world AR constraints. Its explicit focus on manual curation, hand-object interaction, and held-out protocol addresses the inherent domain gap limiting performance in prior AI video editors. The combination of diverse, instruction-driven manipulation types and automated, interpretable metrics facilitates targeted benchmarking of instruction fidelity, visual realism, and temporal coherence. The superior performance of EgoEdit and its real-time variant on this suite demonstrates the necessity of egocentric-specific data and streaming-optimized architectures for responsive AR video editing (Li et al., 5 Dec 2025).

A plausible implication is that the availability of EgoEditData, combined with public release of its accompanying evaluation and editing pipelines, is likely to accelerate progress in egocentric video understanding, AR workflow automation, and downstream applications where interactive, instruction-following video manipulation is essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to EgoEditData: Curated Egocentric Video Edit Dataset.