EgoEditData: Egocentric Video Editing Dataset
- EgoEditData is a specialized dataset that bridges the gap between third-person and egocentric video editing by focusing on high-motion, hand-centric AR scenarios.
- It employs a multi-stage curation pipeline with automated filtering, precise hand and object segmentation, and rigorous human verification to ensure high annotation fidelity.
- The dataset offers paired 'before/after' edits with detailed metrics for temporal consistency and visual quality, enabling robust benchmarking of egocentric video editing models.
EgoEditData is a large-scale, manually curated dataset designed specifically for the development and benchmarking of instruction-guided egocentric video editing systems. Targeting the domain gap between traditional third-person video editing corpora and the complex, high-motion, hand-centric scenarios of first-person augmented reality (AR) applications, EgoEditData provides a unique resource with dense, high-fidelity hand and object annotations, paired edit examples, and natural language instructions. The dataset forms the foundation for advances in rapid, instruction-aligned, and temporally stable video editing within interactive AR environments, where hand-object interactions and egomotion present exceptional challenges (Li et al., 5 Dec 2025).
1. Motivation and Novelty
Egocentric video editing diverges significantly from exocentric (third-person) video editing, where extant datasets—such as Senorita-2M, InsV2V, and EditVerse—provide mostly static-camera, low-egomotion environments with minimal hand-object interaction. In contrast, AR settings place the camera on the user, resulting in rapid camera pans, head movement, strong self-occlusion from hands, and continuous, direct interaction with manipulated objects. A central obstacle is performing object-level edits (insertion, removal, replacement) while maintaining the integrity of hand regions and the realism of manipulated objects under challenging motion conditions. Prior datasets lack a focus on these problem regimes, providing little or no footage of nuanced, hand-object-centric egocentric edits. EgoEditData is purpose-built to address this gap, with “before/after” pairs explicitly centered around hand-object manipulations, and all source hand regions masked and preserved throughout the editing process (Li et al., 5 Dec 2025).
2. Data Collection and Curation Pipeline
EgoEditData is constructed through a multi-stage filtering, annotation, and augmentation pipeline, integrating both automation and manual oversight for high annotation fidelity:
- Raw Footage Sources: Clips are sourced primarily from Ego4D (~3,000 hours, GoPro-rigged head-mounted cameras) and EgoExo4D (egocentric/exocentric pairs rectified to first-person view). Only high-quality monocular clips from specific GoPro models are retained, with automated jitter and blur filters reducing the raw pool to 1.8% survival.
- Hand Detection and Segmentation: The WiLoR model identifies frames containing hands (confidence ≥ 0.75); the top three confident frames initialize SAM 2 segmentation for dense, temporally smooth hand masks. Human reviewers verify mask accuracy (49.6% retention).
- Object Name Extraction: Frames are subsampled at 2 fps; Qwen2.5-VL is prompted to infer the object in direct hand contact. Clips without unambiguous hand-held objects are pruned.
- Object Masking: Grounded-SAM produces object-conditioned coarse masks, with SAM 2 refinement applied per frame. False positives are removed according to mask-confidence thresholds (<0.4), spatial distance from WiLoR hand keypoints, and human adjudication (43.6% retention at this stage).
- Edited Video Synthesis: For each retained clip, GPT-5 Mini suggests 4–5 diverse (ordinary/imaginary) target objects. Qwen-Image generates reference renders, and GPT-5 Mini produces scene descriptions reflecting new object integration. Wan 2.1 VACE 14B (“teacher” model) overlays edited objects onto the video at 1920×1104 px. Human quality control filters final clips (37.8% accept rate after editing).
- Instruction Prompt Generation: All (source clip, edited clip) pairs are passed to GPT-5 Mini, which writes concise, descriptive instruction prompts (e.g., “Replace the black coffee mug with a frosted glass beaker that emits soft steam in the center of frame”).
- Annotation Protocol: Each clip is annotated with high-resolution RGB frames, temporally consistent binary hand/object masks, WiLoR-derived hand skeleton keypoints (XYZ per frame), the edited-object name, and a single free-form instruction. Skeletons and masks are reviewed by humans; no bounding boxes are used.
3. Dataset Composition and Descriptive Statistics
EgoEditData delivers extensive coverage in terms of task diversity, object categories, and annotational granularity:
| Subset | Value | Notes |
|---|---|---|
| Real (original) clips | 10,900 | Sourced from real-world footage |
| Synthetic edits | 38,800 | Model-assisted object manipulation |
| Total unique video clips | 49,700 | ~70 hours of egocentric footage |
| Distinct (source, target, instruction) | 99,700 | All permutation pairs |
| Activity breakdown | ||
| – Change Object (simple replacement) | 54,164 | Majority edit type |
| – Change Object + Special Effects | 39,465 | Includes stylizations |
| – Add Object | 3,651 | Novel insertions |
| – Remove Object | 2,379 | Clean removals |
| Source corpus split | 94% Ego4D / 6% EgoExo4D | |
| Unique source object names | 3,199 | Broad object diversity |
| Unique target object names | 13,632 | Extensive range of replacements |
| Instruction prompt length | μ ≈ 378 chars, σ ≈ 120 chars | Follows log-normal distribution |
| Data splits | 94% train / 3% val / 3% test | Per Ego4D/EgoExo4D proportions |
Each video sample is accompanied by per-frame PNG masks for hands and edited objects, skeleton keypoints as JSON arrays (with structure), and natural-language instructions referencing the target transformation. No bounding boxes are used.
4. Data Formats, Access, and Tooling
EgoEditData is released as a structured directory, adhering to efficient access and DL workflow standards:
- Videos: MP4/H.264 format, 1920×1104 px, 30 fps.
- Masks and Skeletons: Hand/object masks (PNG per frame), hand skeleton keypoints (per-frame JSON).
- Instructions and Metadata: JSON records linking each edit tuple to all relevant masks, skeletons, and instructions.
- Directory Layout:
1 2 3 4 5 6 7 |
/EgoEditData /videos/source/… /videos/edited/… /masks/hands/… /masks/objects/… /skeletons/… annotations.json |
- Code and Scripts: Published utilities include PyTorch Dataset classes and Jupyter notebooks for overlaying masks, reading skeletons, and joining instructions to frame sequences. This facilitates direct integration with model-training workflows and visualization.
5. Evaluation: EgoEditBench Metrics and Protocol
EgoEditBench constitutes a held-out, manually curated benchmark suite for quantitatively assessing egocentric video editing models along task-specific and perceptual axes:
- Clips and Instructions: 100 “unseen” egocentric clips, each with 15 instructions spanning tasks: Add Object, Remove Object, Change Object, Change Background, Camera-Pose Edits, Stylization, Reasoning, Depth→Video, Sketch→Video, Pose→Video, Video→Pose, Video→Depth, Video→Sketch, and Mixed.
- Automatic Metrics:
- VLM_Eval: Mean framewise CLIP-image/text cosine similarity vs. instruction.
- PickScore (PS): PickNet-based perceptual quality measure.
- Text Alignment (TA): Average frame-level CLIP-text alignment (differently weighted).
- Temporal Consistency (TC): One minus mean pairwise frame embedding distance.
- Mask Metrics:
- Hand-Preservation IoU: Per-frame Intersection-over-Union (IoU) of ground-truth and predicted hand masks.
- Object-Change IoU: Analogous, for object masks.
All metrics are benchmarked against human judgment (over 85% agreement), with scripts provided for reproducible evaluation.
6. Usage Guidelines and Best Practices
- Licensing: EgoEditData is distributed under CC BY-NC 4.0, strictly for non-commercial research purposes.
- Citation: The recommended citation is “EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing” by Li et al., CVPR 2025.
- Training Protocols:
- Sampling: Approximately 28% of edit-pair minibatches should come from EgoEditData, with the remainder from broader exocentric datasets, to maximize generalizability.
- Supervision: Hand and object masks should always be included as auxiliary signals (e.g., for cross-attention guidance) to ensure hand-region fidelity during structural edits.
- Latency Optimization: For latency-sensitive real-time applications, apply autoregressive distillation (Self Forcing) over latent chunks of three frames; this achieves sub-second first-frame latency on a single NVIDIA H100 GPU.
EgoEditData supports research and development on next-generation egocentric video editors by supplying nearly 100,000 instruction-to-edit pairs, high-quality hand/object annotations, and a rigorous, multi-dimensional benchmark for real-time AR video editing challenges (Li et al., 5 Dec 2025).