Papers
Topics
Authors
Recent
2000 character limit reached

EgoEditBench: Egocentric Video Editing Benchmark

Updated 11 December 2025
  • The paper introduces EgoEditBench, a comprehensive evaluation suite that quantifies instruction faithfulness, temporal consistency, and preservation of hand-object cues in egocentric video editing.
  • It employs a diverse set of 15 editing tasks derived from a curated dataset, using k-means clustering and GPT-5 Mini for unbiased instruction generation.
  • The benchmark demonstrates significant improvements over exocentric methods, with EgoEdit variants outperforming baselines by 2–3 points in VLM score while supporting real-time, AR-ready editing.

EgoEditBench is a standardized evaluation suite for instruction-guided editing of egocentric videos, specifically designed to address the distinct challenges encountered in first-person, wearable-camera scenarios. Egocentric video presents a significant domain gap from third-person footage due to high-frequency egomotion and pervasive hand–object interactions. Existing benchmarks do not account for these complexities, motivating the creation of EgoEditBench as part of the EgoEdit ecosystem. EgoEditBench focuses on assessing temporal stability, instruction faithfulness, and the accurate preservation of hand and interaction cues, underpinning the quantitative analysis of egocentric video editing methods (Li et al., 5 Dec 2025).

1. Motivation and Scope

EgoEditBench is designed to measure the capabilities of video editors under egocentric conditions, where both content manipulation and temporal coherence must occur amidst rapid camera motion and frequent foreground occlusions from the actor’s hands. The suite complements EgoEditData, a curated dataset emphasizing hand–object interactions, and supports the broader goal of enabling real-time, streaming, instruction-following AR video editing systems operating on a single GPU. It fills a critical evaluation gap, as previous benchmarks and editing pipelines are primarily oriented toward third-person, exocentric video domains, and exhibit elevated latency or degraded performance in egocentric contexts.

2. Task Coverage and Dataset Curation

EgoEditBench comprises 15 specific editing tasks spanning object-level modifications, inverse generation tasks, and compound, multi-task scenarios. These include Add/Remove/Change Object, Change Background, Change Camera Pose, Add Effect, Stylization, and Reasoning. Inverse tasks encompass mappings such as Depth→Video, Sketch→Video, Pose→Video, as well as the corresponding Video→Depth, Video→Sketch, and Video→Pose cases. The evaluation suite also contains multi-task compound edits to assess method generality.

To ensure content diversity and reduce overlap with training data, 100 source videos for EgoEditBench are selected by K-means clustering on BERT-embedded (object+scene) captions. Instructions are generated programmatically using GPT-5 Mini, and for tasks such as X-to-video, auxiliary representations (Canny edges for Sketch, DWpose maps for Pose, and Depth Anything maps for Depth) are synthesized. None of the source videos are included in training sets, ensuring fair evaluation (Li et al., 5 Dec 2025).

3. Evaluation Metrics and Formal Definitions

EgoEditBench adopts metrics paralleling those in EditVerseBench but adapts them for egocentric scenarios. Four principal quantitative measures are used:

  • VLM Score: Cosine similarity between a multi-modal LLM video embedding and the corresponding text instruction, quantifying instruction faithfulness:

VLM(Y^,c)=cos(ϕvideo(Y^),ϕtext(c))\mathrm{VLM}(\hat{Y}, c) = \cos(\phi_{\mathrm{video}}(\hat{Y}), \phi_{\mathrm{text}}(c))

  • PickScore (PS): A composite metric reflecting both generation quality and faithfulness.
  • Text Alignment (TA): CLIP similarity averaged over frames between generated visuals and the textual instruction:

TA=1Ttcos(ϕframe(yt),ϕtext(c))\mathrm{TA} = \frac{1}{T} \sum_{t} \cos(\phi_{\mathrm{frame}}(y_t), \phi_{\mathrm{text}}(c))

  • Temporal Consistency (TC): Mean framewise CLIP similarity across temporally adjacent frames:

TC=1T1tcos(ϕframe(yt),ϕframe(yt+1))\mathrm{TC} = \frac{1}{T - 1} \sum_{t} \cos(\phi_{\mathrm{frame}}(y_t), \phi_{\mathrm{frame}}(y_{t+1}))

This metric suite jointly evaluates semantic alignment, intra-sequence coherence, and fine-grained faithfulness to human language instructions.

4. Reference Tasks and Distribution

The evaluation targets egocentric-specific editing skills, with tasks broadly grouped as follows:

Task Class Example Tasks Input Modalities
Object-level edit Add, Remove, Change Object; Stylization Video + Instruction
Inverse generation Sketch→Video, Depth→Video, Pose→Video Sketch, Depth, Pose
Compound Multi-task, Reasoning Mixed

Object and background manipulations primarily test local editing with preservation of egocentric cues such as intact hands and plausible hand–object affordances. Inverse and compound edits probe the model’s ability to synthesize realistic video conditionally, under strong appearance and motion constraints.

5. Baseline Performance and Ablative Analysis

Quantitative results demonstrate that EgoEdit and its streaming variant EgoEdit-RT outperform exocentric baselines (Lucy Edit, InsV2V, TokenFlow, and all prior real-time methods) by 2–3 points in VLM score on EgoEditBench. On the egocentric benchmark:

  • EgoEdit: VLM = 7.76, PS = 19.21, TA = 16.89, TC = 96.70
  • EgoEdit-RT: VLM = 7.71, PS = 19.13, TA = 16.34, TC = 96.41

While exocentric editors achieve comparable results on third-person footage, they fail to match EgoEdit variants on egocentric benchmarks. Ablations substantiate the necessity of EgoEditData (removal reduces VLM from 7.85 to 4.87) and demonstrate that streaming distillation maintains performance (VLM decreases only slightly from 7.76 to 7.71, enabling throughput advantages). EditVerseBench results on third-person video confirm state-of-the-art generalization for both variants (Li et al., 5 Dec 2025).

6. Significance and Future Directions

EgoEditBench provides a rigorous, domain-specific testing ground for real-time, instruction-guided AR video editing under conditions reflective of practical AR deployment. By codifying metrics, establishing diverse egocentric editing scenarios, and offering a challenging set of tasks, EgoEditBench supports the targeted advancement of temporal coherence, content fidelity, and interaction-aware editing systems.

Limitations identified in current streaming architectures—such as first-frame latency, resolution constraints (512×384@16fps), and vulnerabilities to occlusion and out-of-distribution instructions—motivate avenues for further research. Suggested directions include multi-view, high-resolution editing, explicit modeling of hand-object cues, joint audio–visual generation, and hybrid pipelines integrating explicit egomotion or pose tracking modules. The public release of EgoEditBench and its companion dataset is intended to catalyze progress towards immersive, instruction-guided AR experiences (Li et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to EgoEditBench: Benchmark for Egocentric Video Editing.