Papers
Topics
Authors
Recent
2000 character limit reached

EgoEditData: Egocentric Video Editing Dataset

Updated 9 December 2025
  • EgoEditData is a specialized dataset that bridges the gap between third-person and egocentric video editing by focusing on high-motion, hand-centric AR scenarios.
  • It employs a multi-stage curation pipeline with automated filtering, precise hand and object segmentation, and rigorous human verification to ensure high annotation fidelity.
  • The dataset offers paired 'before/after' edits with detailed metrics for temporal consistency and visual quality, enabling robust benchmarking of egocentric video editing models.

EgoEditData is a large-scale, manually curated dataset designed specifically for the development and benchmarking of instruction-guided egocentric video editing systems. Targeting the domain gap between traditional third-person video editing corpora and the complex, high-motion, hand-centric scenarios of first-person augmented reality (AR) applications, EgoEditData provides a unique resource with dense, high-fidelity hand and object annotations, paired edit examples, and natural language instructions. The dataset forms the foundation for advances in rapid, instruction-aligned, and temporally stable video editing within interactive AR environments, where hand-object interactions and egomotion present exceptional challenges (Li et al., 5 Dec 2025).

1. Motivation and Novelty

Egocentric video editing diverges significantly from exocentric (third-person) video editing, where extant datasets—such as Senorita-2M, InsV2V, and EditVerse—provide mostly static-camera, low-egomotion environments with minimal hand-object interaction. In contrast, AR settings place the camera on the user, resulting in rapid camera pans, head movement, strong self-occlusion from hands, and continuous, direct interaction with manipulated objects. A central obstacle is performing object-level edits (insertion, removal, replacement) while maintaining the integrity of hand regions and the realism of manipulated objects under challenging motion conditions. Prior datasets lack a focus on these problem regimes, providing little or no footage of nuanced, hand-object-centric egocentric edits. EgoEditData is purpose-built to address this gap, with “before/after” pairs explicitly centered around hand-object manipulations, and all source hand regions masked and preserved throughout the editing process (Li et al., 5 Dec 2025).

2. Data Collection and Curation Pipeline

EgoEditData is constructed through a multi-stage filtering, annotation, and augmentation pipeline, integrating both automation and manual oversight for high annotation fidelity:

  • Raw Footage Sources: Clips are sourced primarily from Ego4D (~3,000 hours, GoPro-rigged head-mounted cameras) and EgoExo4D (egocentric/exocentric pairs rectified to first-person view). Only high-quality monocular clips from specific GoPro models are retained, with automated jitter and blur filters reducing the raw pool to 1.8% survival.
  • Hand Detection and Segmentation: The WiLoR model identifies frames containing hands (confidence ≥ 0.75); the top three confident frames initialize SAM 2 segmentation for dense, temporally smooth hand masks. Human reviewers verify mask accuracy (49.6% retention).
  • Object Name Extraction: Frames are subsampled at 2 fps; Qwen2.5-VL is prompted to infer the object in direct hand contact. Clips without unambiguous hand-held objects are pruned.
  • Object Masking: Grounded-SAM produces object-conditioned coarse masks, with SAM 2 refinement applied per frame. False positives are removed according to mask-confidence thresholds (<0.4), spatial distance from WiLoR hand keypoints, and human adjudication (43.6% retention at this stage).
  • Edited Video Synthesis: For each retained clip, GPT-5 Mini suggests 4–5 diverse (ordinary/imaginary) target objects. Qwen-Image generates reference renders, and GPT-5 Mini produces scene descriptions reflecting new object integration. Wan 2.1 VACE 14B (“teacher” model) overlays edited objects onto the video at 1920×1104 px. Human quality control filters final clips (37.8% accept rate after editing).
  • Instruction Prompt Generation: All (source clip, edited clip) pairs are passed to GPT-5 Mini, which writes concise, descriptive instruction prompts (e.g., “Replace the black coffee mug with a frosted glass beaker that emits soft steam in the center of frame”).
  • Annotation Protocol: Each clip is annotated with high-resolution RGB frames, temporally consistent binary hand/object masks, WiLoR-derived hand skeleton keypoints (XYZ per frame), the edited-object name, and a single free-form instruction. Skeletons and masks are reviewed by humans; no bounding boxes are used.

3. Dataset Composition and Descriptive Statistics

EgoEditData delivers extensive coverage in terms of task diversity, object categories, and annotational granularity:

Subset Value Notes
Real (original) clips 10,900 Sourced from real-world footage
Synthetic edits 38,800 Model-assisted object manipulation
Total unique video clips 49,700 ~70 hours of egocentric footage
Distinct (source, target, instruction) 99,700 All permutation pairs
Activity breakdown
– Change Object (simple replacement) 54,164 Majority edit type
– Change Object + Special Effects 39,465 Includes stylizations
– Add Object 3,651 Novel insertions
– Remove Object 2,379 Clean removals
Source corpus split 94% Ego4D / 6% EgoExo4D
Unique source object names 3,199 Broad object diversity
Unique target object names 13,632 Extensive range of replacements
Instruction prompt length μ ≈ 378 chars, σ ≈ 120 chars Follows log-normal distribution
Data splits 94% train / 3% val / 3% test Per Ego4D/EgoExo4D proportions

Each video sample is accompanied by per-frame PNG masks for hands and edited objects, skeleton keypoints as JSON arrays (with (x,y,joint_id)(x, y, joint\_id) structure), and natural-language instructions referencing the target transformation. No bounding boxes are used.

4. Data Formats, Access, and Tooling

EgoEditData is released as a structured directory, adhering to efficient access and DL workflow standards:

  • Videos: MP4/H.264 format, 1920×1104 px, 30 fps.
  • Masks and Skeletons: Hand/object masks (PNG per frame), hand skeleton keypoints (per-frame JSON).
  • Instructions and Metadata: JSON records linking each edit tuple to all relevant masks, skeletons, and instructions.
  • Directory Layout:

1
2
3
4
5
6
7
/EgoEditData
  /videos/source/…
  /videos/edited/…
  /masks/hands/…
  /masks/objects/…
  /skeletons/…
  annotations.json

  • Code and Scripts: Published utilities include PyTorch Dataset classes and Jupyter notebooks for overlaying masks, reading skeletons, and joining instructions to frame sequences. This facilitates direct integration with model-training workflows and visualization.

5. Evaluation: EgoEditBench Metrics and Protocol

EgoEditBench constitutes a held-out, manually curated benchmark suite for quantitatively assessing egocentric video editing models along task-specific and perceptual axes:

  • Clips and Instructions: 100 “unseen” egocentric clips, each with 15 instructions spanning tasks: Add Object, Remove Object, Change Object, Change Background, Camera-Pose Edits, Stylization, Reasoning, Depth→Video, Sketch→Video, Pose→Video, Video→Pose, Video→Depth, Video→Sketch, and Mixed.
  • Automatic Metrics:
    • VLM_Eval: Mean framewise CLIP-image/text cosine similarity vs. instruction. VLM=1Ttcos(CLIP(It),CLIPtext(C))\text{VLM} = \frac{1}{T}\sum_t \cos(\text{CLIP}(I_t), \text{CLIP}_\text{text}(C))
    • PickScore (PS): PickNet-based perceptual quality measure.
    • Text Alignment (TA): Average frame-level CLIP-text alignment (differently weighted).
    • Temporal Consistency (TC): One minus mean pairwise frame embedding distance. TC=11T1tf(It+1)f(It)2\text{TC} = 1 - \frac{1}{T-1}\sum_t \|f(I_{t+1}) - f(I_t)\|_2
    • Mask Metrics:
    • Hand-Preservation IoU: Per-frame Intersection-over-Union (IoU) of ground-truth and predicted hand masks. Hand-Preserve IoU=1TtHtHˉtHtHˉt\text{Hand-Preserve IoU} = \frac{1}{T}\sum_t \frac{|H_t \cap \bar{H}_t|}{|H_t \cup \bar{H}_t|}
    • Object-Change IoU: Analogous, for object masks.

All metrics are benchmarked against human judgment (over 85% agreement), with scripts provided for reproducible evaluation.

6. Usage Guidelines and Best Practices

  • Licensing: EgoEditData is distributed under CC BY-NC 4.0, strictly for non-commercial research purposes.
  • Citation: The recommended citation is “EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing” by Li et al., CVPR 2025.
  • Training Protocols:
    • Sampling: Approximately 28% of edit-pair minibatches should come from EgoEditData, with the remainder from broader exocentric datasets, to maximize generalizability.
    • Supervision: Hand and object masks should always be included as auxiliary signals (e.g., for cross-attention guidance) to ensure hand-region fidelity during structural edits.
    • Latency Optimization: For latency-sensitive real-time applications, apply autoregressive distillation (Self Forcing) over latent chunks of three frames; this achieves sub-second first-frame latency on a single NVIDIA H100 GPU.

EgoEditData supports research and development on next-generation egocentric video editors by supplying nearly 100,000 instruction-to-edit pairs, high-quality hand/object annotations, and a rigorous, multi-dimensional benchmark for real-time AR video editing challenges (Li et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to EgoEditData.