EgoEditData: Egocentric Video Editing Dataset

Updated 9 December 2025

EgoEditData is a specialized dataset that bridges the gap between third-person and egocentric video editing by focusing on high-motion, hand-centric AR scenarios.
It employs a multi-stage curation pipeline with automated filtering, precise hand and object segmentation, and rigorous human verification to ensure high annotation fidelity.
The dataset offers paired 'before/after' edits with detailed metrics for temporal consistency and visual quality, enabling robust benchmarking of egocentric video editing models.

EgoEditData is a large-scale, manually curated dataset designed specifically for the development and benchmarking of instruction-guided egocentric video editing systems. Targeting the domain gap between traditional third-person video editing corpora and the complex, high-motion, hand-centric scenarios of first-person augmented reality (AR) applications, EgoEditData provides a unique resource with dense, high-fidelity hand and object annotations, paired edit examples, and natural language instructions. The dataset forms the foundation for advances in rapid, instruction-aligned, and temporally stable video editing within interactive AR environments, where hand-object interactions and egomotion present exceptional challenges (Li et al., 5 Dec 2025).

1. Motivation and Novelty

Egocentric video editing diverges significantly from exocentric (third-person) video editing, where extant datasets—such as Senorita-2M, InsV2V, and EditVerse—provide mostly static-camera, low-egomotion environments with minimal hand-object interaction. In contrast, AR settings place the camera on the user, resulting in rapid camera pans, head movement, strong self-occlusion from hands, and continuous, direct interaction with manipulated objects. A central obstacle is performing object-level edits (insertion, removal, replacement) while maintaining the integrity of hand regions and the realism of manipulated objects under challenging motion conditions. Prior datasets lack a focus on these problem regimes, providing little or no footage of nuanced, hand-object-centric egocentric edits. EgoEditData is purpose-built to address this gap, with “before/after” pairs explicitly centered around hand-object manipulations, and all source hand regions masked and preserved throughout the editing process (Li et al., 5 Dec 2025).

2. Data Collection and Curation Pipeline

EgoEditData is constructed through a multi-stage filtering, annotation, and augmentation pipeline, integrating both automation and manual oversight for high annotation fidelity:

Raw Footage Sources: Clips are sourced primarily from Ego4D (~3,000 hours, GoPro-rigged head-mounted cameras) and EgoExo4D (egocentric/exocentric pairs rectified to first-person view). Only high-quality monocular clips from specific GoPro models are retained, with automated jitter and blur filters reducing the raw pool to 1.8% survival.
Hand Detection and Segmentation: The WiLoR model identifies frames containing hands (confidence ≥ 0.75); the top three confident frames initialize SAM 2 segmentation for dense, temporally smooth hand masks. Human reviewers verify mask accuracy (49.6% retention).
Object Name Extraction: Frames are subsampled at 2 fps; Qwen2.5-VL is prompted to infer the object in direct hand contact. Clips without unambiguous hand-held objects are pruned.
Object Masking: Grounded-SAM produces object-conditioned coarse masks, with SAM 2 refinement applied per frame. False positives are removed according to mask-confidence thresholds (<0.4), spatial distance from WiLoR hand keypoints, and human adjudication (43.6% retention at this stage).
Edited Video Synthesis: For each retained clip, GPT-5 Mini suggests 4–5 diverse (ordinary/imaginary) target objects. Qwen-Image generates reference renders, and GPT-5 Mini produces scene descriptions reflecting new object integration. Wan 2.1 VACE 14B (“teacher” model) overlays edited objects onto the video at 1920×1104 px. Human quality control filters final clips (37.8% accept rate after editing).
Instruction Prompt Generation: All (source clip, edited clip) pairs are passed to GPT-5 Mini, which writes concise, descriptive instruction prompts (e.g., “Replace the black coffee mug with a frosted glass beaker that emits soft steam in the center of frame”).
Annotation Protocol: Each clip is annotated with high-resolution RGB frames, temporally consistent binary hand/object masks, WiLoR-derived hand skeleton keypoints (XYZ per frame), the edited-object name, and a single free-form instruction. Skeletons and masks are reviewed by humans; no bounding boxes are used.

3. Dataset Composition and Descriptive Statistics

EgoEditData delivers extensive coverage in terms of task diversity, object categories, and annotational granularity:

Subset	Value	Notes
Real (original) clips	10,900	Sourced from real-world footage
Synthetic edits	38,800	Model-assisted object manipulation
Total unique video clips	49,700	~70 hours of egocentric footage
Distinct (source, target, instruction)	99,700	All permutation pairs
Activity breakdown
– Change Object (simple replacement)	54,164	Majority edit type
– Change Object + Special Effects	39,465	Includes stylizations
– Add Object	3,651	Novel insertions
– Remove Object	2,379	Clean removals
Source corpus split		94% Ego4D / 6% EgoExo4D
Unique source object names	3,199	Broad object diversity
Unique target object names	13,632	Extensive range of replacements
Instruction prompt length	μ ≈ 378 chars, σ ≈ 120 chars	Follows log-normal distribution
Data splits	94% train / 3% val / 3% test	Per Ego4D/EgoExo4D proportions

Each video sample is accompanied by per-frame PNG masks for hands and edited objects, skeleton keypoints as JSON arrays (with $(x, y, joint\_id)$ structure), and natural-language instructions referencing the target transformation. No bounding boxes are used.

4. Data Formats, Access, and Tooling

EgoEditData is released as a structured directory, adhering to efficient access and DL workflow standards:

Videos: MP4/H.264 format, 1920×1104 px, 30 fps.
Masks and Skeletons: Hand/object masks (PNG per frame), hand skeleton keypoints (per-frame JSON).
Instructions and Metadata: JSON records linking each edit tuple to all relevant masks, skeletons, and instructions.
Directory Layout:

/EgoEditData
  /videos/source/…
  /videos/edited/…
  /masks/hands/…
  /masks/objects/…
  /skeletons/…
  annotations.json

Code and Scripts: Published utilities include PyTorch Dataset classes and Jupyter notebooks for overlaying masks, reading skeletons, and joining instructions to frame sequences. This facilitates direct integration with model-training workflows and visualization.

5. Evaluation: EgoEditBench Metrics and Protocol

EgoEditBench constitutes a held-out, manually curated benchmark suite for quantitatively assessing egocentric video editing models along task-specific and perceptual axes:

Clips and Instructions: 100 “unseen” egocentric clips, each with 15 instructions spanning tasks: Add Object, Remove Object, Change Object, Change Background, Camera-Pose Edits, Stylization, Reasoning, Depth→Video, Sketch→Video, Pose→Video, Video→Pose, Video→Depth, Video→Sketch, and Mixed.
Automatic Metrics:
- VLM_Eval: Mean framewise CLIP-image/text cosine similarity vs. instruction. $\text{VLM} = \frac{1}{T}\sum_t \cos(\text{CLIP}(I_t), \text{CLIP}_\text{text}(C))$
- PickScore (PS): PickNet-based perceptual quality measure.
- Text Alignment (TA): Average frame-level CLIP-text alignment (differently weighted).
- Temporal Consistency (TC): One minus mean pairwise frame embedding distance. $\text{TC} = 1 - \frac{1}{T-1}\sum_t \|f(I_{t+1}) - f(I_t)\|_2$
- Mask Metrics:
- Hand-Preservation IoU: Per-frame Intersection-over-Union (IoU) of ground-truth and predicted hand masks. $\text{Hand-Preserve IoU} = \frac{1}{T}\sum_t \frac{|H_t \cap \bar{H}_t|}{|H_t \cup \bar{H}_t|}$
- Object-Change IoU: Analogous, for object masks.

All metrics are benchmarked against human judgment (over 85% agreement), with scripts provided for reproducible evaluation.

6. Usage Guidelines and Best Practices

Licensing: EgoEditData is distributed under CC BY-NC 4.0, strictly for non-commercial research purposes.
Citation: The recommended citation is “EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing” by Li et al., CVPR 2025.
Training Protocols:
- Sampling: Approximately 28% of edit-pair minibatches should come from EgoEditData, with the remainder from broader exocentric datasets, to maximize generalizability.
- Supervision: Hand and object masks should always be included as auxiliary signals (e.g., for cross-attention guidance) to ensure hand-region fidelity during structural edits.
- Latency Optimization: For latency-sensitive real-time applications, apply autoregressive distillation (Self Forcing) over latent chunks of three frames; this achieves sub-second first-frame latency on a single NVIDIA H100 GPU.

EgoEditData supports research and development on next-generation egocentric video editors by supplying nearly 100,000 instruction-to-edit pairs, high-quality hand/object annotations, and a rigorous, multi-dimensional benchmark for real-time AR video editing challenges (Li et al., 5 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing (2025)

EgoEditData: Egocentric Video Editing Dataset

1. Motivation and Novelty

2. Data Collection and Curation Pipeline

3. Dataset Composition and Descriptive Statistics

4. Data Formats, Access, and Tooling

5. Evaluation: EgoEditBench Metrics and Protocol

6. Usage Guidelines and Best Practices

Whiteboard

Follow Topic

Continue Learning

EgoEditData: Egocentric Video Editing Dataset

1. Motivation and Novelty

2. Data Collection and Curation Pipeline

3. Dataset Composition and Descriptive Statistics

4. Data Formats, Access, and Tooling

5. Evaluation: EgoEditBench Metrics and Protocol

6. Usage Guidelines and Best Practices

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics