Papers
Topics
Authors
Recent
2000 character limit reached

EgoEdit: Egocentric Video Editing

Updated 9 December 2025
  • EgoEdit is a comprehensive ecosystem for first-person video editing, addressing rapid egomotion, hand-object interactions, and domain gaps.
  • It integrates EgoEditData, a large annotated corpus, with a multi-stage generative model using advanced distillation and streaming inference techniques.
  • EgoEditBench provides a standardized evaluation suite, demonstrating significant improvements in instruction faithfulness, temporal consistency, and text alignment.

EgoEdit defines a comprehensive ecosystem for instruction-guided editing of egocentric (first-person) video, uniquely addressing the challenges of rapid egomotion, frequent hand–object interactions, and the domain gap between egocentric and third-person footage. It comprises three primary components: EgoEditData, a purpose-built annotated dataset for egocentric editing; EgoEdit, a real-time instruction-following generative video editing model with streaming capabilities; and EgoEditBench, a standardized benchmark and evaluation suite for egocentric video editing performance. EgoEdit delivers temporally stable, instruction-faithful results in real time on a single GPU, demonstrating clear improvements over leading baselines specifically in egocentric settings while maintaining competitive performance on generic editing tasks (Li et al., 5 Dec 2025).

1. EgoEditData: A Dedicated Egocentric Video-Editing Corpus

EgoEditData is explicitly designed for egocentric video editing. It comprises 10,900 source videos, drawing 93.6% from Ego4D and 6.4% from EgoExo4D, with 38,800 synthetic edited variants generated (mean ≈3.6 edits per source) and reaching a total of 99,700 edit pairs (~70 hours, 1920×1104 px generation, 512×384 px model input). Average clip length is 5 s at 16 fps.

Hand–object interactions serve as the organizing principle throughout, with annotation targeting four principal instruction categories:

  1. Change Object (54,164 pairs)
  2. Change Object + Special Effects (39,465 pairs)
  3. Add Object (3,651 pairs)
  4. Remove Object (2,379 pairs)

Additionally, auxiliary tasks are included for depth-generation, sketching, pose modification, background/camera changes, stylization, and compositional reasoning, as well as "combined" prompts that chain edits. The annotation protocol is highly structured: source videos are filtered for quality, hand masks are extracted via WiLoR→SAM2 and manually reviewed (49.6% retained), object names are extracted using Qwen2.5-VL (retaining clips with active interaction only), and object masks are generated by Grounded-SAM→SAM2 with geometric heuristics and further manual review (43.6% retained). Object editing is managed through VACE-14B, with edit proposals synthesized by GPT-5 Mini and further manual curation (37.8% retained overall). Natural language edit instructions are generated using GPT-5 Mini, emphasizing scene-awareness and precision.

Statistically, EgoEditData offers 3,199 unique source objects and 13,632 unique target objects, driven by GPT-5 diversity, with a mean prompt length of ~378 characters and scenario distribution balanced across the top ten Ego4D-defined tasks (e.g., "kitchen," "workbench," "street"). This dataset forms the empirical foundation upon which the model's egocentric editing specialization is built.

2. EgoEdit: Model Architecture and Real-Time Streaming Inference

EgoEdit employs a multi-stage pipeline. Pretraining centers on a text-to-video generator (latent DiT with Rectified-Flow flow matching). The model is then finetuned as an editor on EgoEditData supplemented with mixed image/video editing corpora. Distillation takes two forms: bidirectional DMD yields a four-step model; autoregressive Self-Forcing enables real-time streaming inference.

Core modules involve a Wan 2.1 latent autoencoder compressing 512×384×80 RGB inputs to 64×48×20 latent tokens. The backbone is a DiT-style Transformer: 32 layers, hidden size 4096, 32 attention heads, with both self- and cross-attention over T5+CLIP text tokens. Temporal position is encoded via MLP time embeddings; QK-normalization and FlashAttention ensure computational efficiency and stability. Conditioning occurs channel-wise, concatenating source and noisy target latents to avoid quadratic self-attention scaling.

The primary input consists of a noisy target latent XtRC×T×H×WX_t\in\mathbb{R}^{C\times T'\times H'\times W'}, a corresponding source latent XsX^s, and text tokens cc. Output is the predicted "velocity" v=dX/dtv = dX/dt, which is subsequently decoded to RGB. The key training objective is flow matching: for Xt=(1t)X+tnX_t = (1-t)\cdot X + t\cdot n with nN(0,I)n\sim\mathcal{N}(0,I), the generator GG predicts vt(nX)v_t \approx (n-X), minimizing

LRF=Et,X,n[G(Xt,t)(nX)22].\mathcal{L}_{RF} = \mathbb{E}_{t,X,n}\left[\|G(X_t, t) - (n - X)\|_2^2\right].

During editing fine-tuning, this loss is conditioned on the paired source, prompt, and target latent.

Temporal stability is an emergent property of the autoregressive sampling procedure in distillation; no explicit temporal loss is necessary.

Streaming techniques include chunked (three latent frames at a time) autoregressive inference with a four-step solver (4 NFEs per chunk), KV-caching for attention reuse, and channel-wise concatenation maintaining a constant token budget. On a single H100 GPU at 512×384 px, first-frame latency is 855 ms (model: 76 ms; autoencoder: 217 ms; recording: 562 ms), with a throughput of ~38 fps, enabling real-time interaction.

3. EgoEditBench: Egocentric Editing Benchmark and Evaluation Suite

EgoEditBench serves as the primary standardized evaluation platform for egocentric video editing. It employs 100 held-out source videos from Ego4D, selected and clustered by object and scene. Fifteen diverse tasks span object addition/removal, effect insertion, object/background/camera-pose changes, stylization, reasoning, and various image-to-video mappings (depth, pose, sketch), with "combined" multi-task instructions. All instructions are generated synthetically via GPT-5 Mini; image-based tasks use depth/OpenCV pose/Canny sketch inputs.

Evaluation relies on four principal metrics:

  • Instruction Faithfulness (VLM Score):

VLM(c,Xout)=1Ni=1Ncos(Embed(c),Embed(fi))VLM(c, X_\text{out}) = \frac{1}{N}\sum_{i=1}^N\cos(\text{Embed}(c), \text{Embed}(f_i))

where fif_i are generated frames and cc is the instruction.

  • PickScore (PS): CLIP-based realism metric, reference-free.
  • Text Alignment (TA): CLIP similarity between aggregate video-level text and composite frames.
  • Temporal Consistency (TC):

TC=1N1i=1N1cos(Embed(fi),Embed(fi+1))TC = \frac{1}{N-1}\sum_{i=1}^{N-1}\cos(\text{Embed}(f_i),\text{Embed}(f_{i+1}))

The benchmarking protocol includes comparison to attention-manipulation (TokenFlow, STDF), first-frame propagation (Señorita-2M, AnyV2V), instruction-guided (InsV2V, Lucy Edit, EditVerse), and leading streaming (StreamDiffusion, StreamDiffusionV2) methods.

Comparative Performance on EgoEditBench

Method VLM (↑) PS (↑) TA (↑) TC (↑)
TokenFlow 4.99 18.91 15.89 95.0
STDF 4.59 18.69 15.64 93.9
Señorita-2M 7.52 18.85 16.25 95.9
AnyV2V 6.72 18.65 15.35 92.4
InsV2V 5.24 18.81 14.92 94.0
Lucy Edit 5.44 18.87 15.03 94.4
StreamDiffusion 4.32 18.92 14.15 86.8
StreamDiffusionV2 2.55 18.63 12.75 94.3
EgoEdit 7.76 19.21 16.89 96.7
EgoEdit-RT 7.71 19.13 16.34 96.4

EgoEdit demonstrates a VLM score improvement of +2.32 over Lucy Edit in instruction-guided baselines and marked gains in TA (+1.04) and TC (+2.3%).

Qualitative analysis indicates EgoEdit robustly preserves hand structure, occlusions, and egomotion-consistent geometry, while baselines often fail to effect the target edit, hallucinate, or disrupt temporal consistency. Limited failure modes in EgoEdit-RT appear as "chunk boundary" artifacts or slight performance drops under out-of-distribution prompt conditions.

4. Advantages, Trade-Offs, and Generalization

EgoEdit's advantages in egocentric video editing are anchored in its domain-aligned dataset, channel-wise source conditioning (mitigating attention budget blow-up), and streaming distillation to maintain temporal coherence under significant egomotion. This enables instruction-faithful editing with strong temporal stability, even when hands and manipulated objects dominate the field of view. On broader (exocentric) edits, EgoEdit exhibits only a marginal VLM drop (0.24 points), compared to larger drops for other methods, reflecting successful training on a mix of domain-specific and generic corpora.

A plausible implication is that the strategy of domain-specialized pretraining, followed by judicious distillation and efficient attention mechanisms, could be extensible to other real-time, user-guided video editing domains with challenging spatial or temporal dynamics.

5. Open Challenges and Future Directions

Continued refinement targets ultra-low latency (sub-500 ms first-frame), higher spatial resolutions (720p+), and higher frame rates (30 fps). Advances in integrating explicit 3D geometry or SLAM are prioritized to further enhance parallax and physical occlusion handling. Multimodal extension is anticipated, including synchronized audio synthesis ("SonicDiffusion") and haptic feedback for AR/VR interaction.

Challenges persist in ensuring robustness to fast, transient hand occlusions; supporting highly interactive user interfaces and few-shot, human-in-the-loop correction; and enforcing privacy and safety constraints, especially to prevent unintended content insertion in live-streamed or AR editing scenarios.

EgoEdit, through its aligned dataset, realtime model, and benchmark suite, establishes a rigorous foundation for advancing interactive, first-person, instruction-guided video editing in AR/VR contexts (Li et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to EgoEdit.