EgoEditBench: Egocentric Video Editing Benchmark
- EgoEditBench is a benchmark that provides quantitative evaluation of instruction-guided editing in egocentric video scenarios, emphasizing rapid egomotion, hand–object interactions, and real-time constraints.
- It measures performance across instruction faithfulness, hand & interaction preservation, and temporal stability using metrics from video-language models and feature extractors.
- The benchmark supports AR applications and advances research by offering diverse tasks and standardized evaluation protocols for first-person video editing.
EgoEditBench is a standardized evaluation suite designed to quantitatively assess instruction-guided editing of egocentric video, focusing on interactive augmented reality (AR) and first-person use cases. Unlike established benchmarks for third-person (exocentric) video, EgoEditBench explicitly targets the challenges of rapid egomotion, hand–object interactions, and real-time editing, enabling systematic comparison of models and methods on these frontiers. It enables reproducible measurement of model performance along the axes of instruction faithfulness, hand and interaction preservation, and temporal stability under camera motion (Li et al., 5 Dec 2025).
1. Motivation and Distinguishing Principles
EgoEditBench addresses domain-specific requirements of egocentric video editing absent in third-person benchmarks. Egocentric content is typified by:
- Rapid egomotion: First-person camera motion induces frequent viewpoint changes, leading to unique temporal and geometric artifacts.
- Hand–object interactions: The omnipresence of the user's hands, often manipulating or occluding objects, produces persistent spatial and semantic complexities for editors.
Existing benchmarks, including EditVerseBench, are built on exocentric footage, do not consider persistent hand-appearance or manipulation context, and do not enforce real-time constraints. They therefore fail to measure critical dimensions for AR-style interactive editing. EgoEditBench was constructed to fill this evaluative gap with three objectives:
- Instruction Faithfulness: Quantify alignment between the edited video and the user's natural-language directive.
- Hand & Interaction Preservation: Measure the degree to which hand appearance, pose, and local manipulation context are retained post-edit.
- Temporal Stability under Egomotion: Assess frame-to-frame consistency in the presence of significant movement—especially, resilience to flicker and geometric artifacts induced by camera motion.
Standardization of these evaluation dimensions enables repeatable, quantitative comparison of egocentric editors for both research benchmarking and applied deployment (Li et al., 5 Dec 2025).
2. Core Metrics and Evaluation Protocols
EgoEditBench implements quantitative metrics, computed per video instance and averaged across the benchmark. Let denote the number of frames in the edited clip, the -th frame, the instruction, and the source and edited sequences.
Instruction Faithfulness (VLM Score):
Uses a pretrained video-LLM (CLIP-based), with:
where . Higher scores indicate edits closely aligned with the intended instruction.
Hand & Interaction Preservation (PickScore):
Evaluates local appearance preservation for hands and manipulated objects in masked regions :
where is a feature extractor (VGG, CLIP), is a patch from and the corresponding edited patch, sampled within hand/object masks.
Temporal Stability under Egomotion (TC):
Quantifies frame-to-frame embedding similarity:
where is a frame encoder (e.g., CLIP-frame, DINO). High marks temporal smoothness under egomotion.
The protocol evaluates per task and method using the same prompts. All metrics are averaged per task to counterbalance disproportionately large categories (Li et al., 5 Dec 2025).
3. Benchmark Composition and Task Diversity
EgoEditBench consists of 1,700 editing instances spanning 15 empirical tasks, sampled from egocentric sequences withheld from the main EgoEditData corpus. Source–object diversity is enforced using BERT embedding-based clustering of captions, resulting in 10 clusters × 10 samples = 100 unique source videos.
Task taxonomy and instance counts:
| Task Category | Instances |
|---|---|
| Change Object | 400 |
| Add Object | 50 |
| Remove Object | 50 |
| Change Background | 100 |
| Change Camera Pose | 100 |
| Add Effect | 100 |
| Stylization | 100 |
| Reasoning | 100 |
| Depth-to-Video | 100 |
| Sketch-to-Video | 100 |
| Pose-to-Video | 100 |
| Video-to-Depth | 100 |
| Video-to-Sketch | 100 |
| Video-to-Pose | 100 |
| Combined Task (multi-edit) | 100 |
For inputs requiring auxiliary signals, per-frame Canny edges (OpenCV), depth maps (Depth Anything), and 2D poses (DWpose) are synthesized. Tasks requiring external reference images (e.g., Propagation, Inpainting) are excluded to ensure fairness (Li et al., 5 Dec 2025).
4. Experimental Setup and Real-Time Protocol
Training:
Models are finetuned on EgoEditData (99.7k egocentric video-instruction pairs), supplemented with 1.3M external video-edit and 3.5M image-edit samples. Training uses AdamW (learning rate , weight decay 0.1).
Distillation and Streaming:
DMD distillation reduces diffusion steps , with self-forcing enabling autoregressive streaming generation (chunk size 3 latent frames).
Inference:
- Resolution: px, 16 fps.
- Classifier-free guidance: 7.5 (editing).
- Real-time streaming: processes 9 latent frames, then 12 per chunk, supporting "watch-as-you-generate" interactivity.
Latency constraints:
- Sub-second first-frame latency target.
- EgoEdit-RT achieves 855 ms first-frame latency, 38.1 fps end-to-end (single NVIDIA H100, including autoencoder). This enforces hardware-admissible real-time throughput as a primary design constraint (Li et al., 5 Dec 2025).
5. Quantitative Comparison and Results
EgoEditBench demonstrates substantial domain gaps and model variance:
| Method | VLM ↑ | PickScore ↑ | TextAlign ↑ | TempCons ↑ |
|---|---|---|---|---|
| TokenFlow | 4.99 | 18.91 | 15.89 | 95.04 |
| STDF | 4.59 | 18.69 | 15.64 | 93.96 |
| Señorita-2M | 7.52 | 18.85 | 16.25 | 95.86 |
| AnyV2V | 6.72 | 18.65 | 15.35 | 92.37 |
| InsV2V | 5.24 | 18.81 | 14.92 | 94.01 |
| LucyEdit | 5.44 | 18.87 | 15.03 | 94.41 |
| StreamDiffusion | 4.32 | 18.92 | 14.15 | 86.83 |
| StreamDiffV2 | 2.55 | 18.63 | 12.75 | 94.31 |
| EgoEdit | 7.76 | 19.21 | 16.89 | 96.70 |
| EgoEdit-RT | 7.71 | 19.13 | 16.34 | 96.41 |
Relative to LucyEdit, EgoEdit gains +2.32 VLM, +0.34 PickScore, +1.86 TextAlign, and +2.29 TempCons, with EgoEdit-RT almost matching non-streaming EgoEdit performance (VLM drop < 0.05). On exocentric benchmarks such as EditVerseBench, EgoEdit(-RT) remains competitive (–0.26 VLM vs EditVerse) while achieving clear gains in egocentric scenarios (Li et al., 5 Dec 2025).
6. Construction Innovations and Significance
Key construction features include:
- Video source hold-out: Source videos are strictly excluded from EgoEditData training to ensure generalization.
- Diversity by clustering: Use of BERT-driven object+scene clustering enforces task and scenario variety.
- GPT-driven prompt generation: Instruction prompts (including spatial and reasoning-based tasks) are formulated with GPT-5 Mini, enhancing naturalness and context sensitivity.
- Automated signal synthesis: Auxiliary signals for X-to-Video (e.g., edge, pose, depth) tasks are automatically generated ensuring consistent conditioning.
- Task balancing: Equal-weight averaging precludes large tasks (e.g., Change Object) from dominating the overall score.
- Real-time enforced protocols: Protocol design includes explicit latency and streaming constraints, ensuring practical applicability in interactive AR contexts.
The holistic design of EgoEditBench underpins its suitability for driving progress in real-time, reliable first-person video editing and interactive AR applications, providing granularity on key axes not addressed by existing benchmarks (Li et al., 5 Dec 2025).
7. Implications and Future Directions
EgoEditBench provides a reproducible, quantitative platform for advancing egocentric video editors. Its metric structure foregrounds instruction following, preservation of manipulation context, and robustness to camera motion. By integrating hold-out video selection, clustering-based diversity, and real-time system-level constraints, EgoEditBench catalyzes research toward scalable and AR-admissible video editing methods.
A plausible implication is that continued expansion (e.g., broader objective coverage, more complex reasoning tasks, expanded sensor modalities) will enhance the utility of EgoEditBench as AR/VR hardware, editing workflows, and user-interaction paradigms evolve.