EgoEdit: Real-Time Egocentric Video Editing
- The paper presents EgoEdit, a unified framework that integrates a bespoke dataset, a high-throughput streaming model, and a standardized benchmark for egocentric video editing.
- The model leverages latent-space concatenation and a flow-matching training objective to ensure temporal stability and instruction fidelity in dynamic AR scenarios.
- A dual-phase distillation pipeline reduces inference steps drastically, achieving a first-frame latency of 855 ms and a sustained performance of 38.1 fps on a single GPU.
EgoEdit is a real-time egocentric video editing model developed to address unique challenges in first-person video editing, such as rapid egomotion, frequent hand–object occlusions, and the requirements for interactive augmented reality (AR) applications. Existing video editing pipelines, primarily optimized for third-person footage, exhibit significant latency and lack robustness to the domain gap presented by egocentric data. EgoEdit introduces a unified ecosystem consisting of a bespoke dataset (EgoEditData), a high-throughput streaming model, and a standardized evaluation benchmark (EgoEditBench), collectively enabling temporally stable, instruction-faithful egocentric video editing at interactive latency on a single GPU (Li et al., 5 Dec 2025).
1. Model Architecture and Editing Framework
EgoEdit builds on a state-of-the-art text-to-video diffusion-transformer architecture, specifically incorporating a Wan 2.1 latent autoencoder paired with a 10.7B-parameter DiT (Diffusion Transformer) backbone composed of 32 transformer blocks (hidden dimension 4096; 32 heads per block). Each transformer block sequentially applies:
- Self-attention (with Rotary Position Embeddings and QK-normalization)
- Cross-attention over fused T5+CLIP text encoder tokens
- MLP layers, all modulated by timestep embeddings
The video editing paradigm is realized via latent-space operations. To adapt the foundation generator for editing, EgoEdit concatenates the source latent video and noisy target along the channel dimension. This preserves computational efficiency—self-attention complexity remains quadratic in spatial–temporal patch count—and facilitates edit conditioning. The model receives text instructions through cross-attention at each block.
The denoiser is trained to predict the instantaneous velocity along a linear flow between latent noise and the target edited latent, as specified by the flow-matching objective: This objective, a variant of Rectified Flow, holds for pretraining and editing fine-tuning. Temporal stability is derived not from explicit egomotion-aware modules but from (a) channel-wise video conditioning, (b) training on egocentric data exhibiting rapid motion, and (c) the model's global spatiotemporal attention blocks.
2. Streaming Inference and Latency Reduction
Traditional diffusion-based editors exhibit frame-wise latency unsuitable for interactive scenarios (e.g., seconds per clip for 40–80 model evaluations on an H100 GPU). EgoEdit achieves real-time streaming via a dual-phase distillation pipeline:
- Bidirectional DMD Distillation: Compresses a 40-step bidirectional teacher into a 4-step student via distilled guidance, reducing Number of Function Evaluations (NFEs) from 80 to 4 and increasing model-only throughput to fps.
- Autoregressive Self-Forcing: The distilled 4-step student is rolled out in autoregressive -frame chunks, minimizing exposure bias by correcting own outputs based on a DMD-style loss relative to the teacher's trajectory. Decoding occurs chunk-wise, yielding a first-frame latency of 855 ms (including 562 ms for recording, 217 ms autoencoder encode/decode, and 75 ms model inference) and a steady end-to-end throughput of 38.1 fps.
Streaming inference operates by encoding the initial source video chunk, denoising with instruction conditioning, decoding output frames, and sliding the latent window forward. No additional hand-mask or explicit temporal losses are introduced; stability must emerge from model capacity and training data.
3. EgoEditData: Egocentric Editing Dataset
EgoEditData is a curated corpus comprising 99.7K before/after edit pairs across 49.7K unique clips, designed to address the egocentric–third-person domain gap. Data acquisition and curation involve multiple automated and manual stages:
| Stage | Methodology/Toolchain | Retention Rate |
|---|---|---|
| Video Selection | Ego4D/EgoExo4D filtering (GoPro, quality) | 1.8% |
| Hand Masking | WiLoR hand detection SAM2 | 49.6% |
| Object Naming | Qwen2.5-VL visual-LLM | — |
| Object Masking | Grounded-SAM + SAM2, filtered by skeleton | — |
| Object Editing | GPT-5 (objects/prompts), Qwen-Image, | 37.8% of generated |
| Wan-VACE, human filtering | edits retained | |
| Instruction Synthesis | GPT-5 refinement | — |
Dataset statistics include 10.9K real videos and 38.8K synthetic edits, with an average of 3.6 edited variants per source. Edit types are distributed as follows: 54K Change Object, 39K Change Object + Effects, 3.6K Add Object, and 2.4K Remove Object. There are 13.6K distinct target objects and 3.2K unique source objects; natural language instructions average ≈378 characters.
4. Evaluation and Benchmarking with EgoEditBench
EgoEditBench standardizes evaluation across 15 egocentric editing tasks, encompassing Add/Remove Object, Change Object/Background/Camera Pose, Add Effect/Stylization/Reasoning, X-to-Video (Depth, Sketch, Pose), Video-to-X tasks, and Multi-Task combinations.
Primary evaluation metrics are aligned with EditVerseBench:
- VLM-Eval: Visual-LLM score of instruction alignment (primary metric)
- PickScore (PS): Visual quality, 5-point scale
- Text Alignment (TA): CLIP-based frame–instruction match
- Temporal Consistency (TC): CLIP flow between adjacent frames
VLM-Eval exhibits strong (≈86%) agreement with human preference compared to LucyEdit and InsV2V, making it the principal ranking signal.
| Method | VLM-Eval ↑ | PickScore ↑ | TA ↑ | TC ↑ |
|---|---|---|---|---|
| EgoEdit | 7.76 | 19.21 | 16.89 | 96.70 |
| LucyEdit | 5.44 | 18.87 | 15.03 | 94.41 |
| InsV2V | 5.24 | 18.81 | 14.92 | 94.01 |
| TokenFlow/STDF | ~5.0 | 18.9 | 15.7 | — |
| SENORITA-2M | ~7.5 | — | — | — |
| AnyV2V | ~7.5 | — | — | — |
| StreamDiffusion | 2.5–4.3 | — | — | — |
EgoEdit approaches state-of-the-art performance even on third-person EditVerseBench (e.g., VLM 8.00 vs. baseline 8.26). Ablation studies indicate that distillation (EgoEdit → DMD → RT) preserves VLM ≈ 7.7 at vastly reduced inference steps, and that fine-tuning data quantity is critical (VLM 4.87 → 7.85 for 0% → 100% dataset fractions).
5. Training Objectives and Optimization
Training stages—comprising pretraining, editing fine-tuning, and distillation—all follow variants of the flow-matching objective. The core loss is: For DMD distillation, a student is optimized to match the teacher : Autoregressive self-forcing is implemented by rolling out the student on chunks and comparing the denoising trajectory with the teacher's bidirectional path. No additional explicit hand-mask or temporal losses are applied; rather, temporal coherence is afforded by model design and the diversity of training data.
6. Limitations and Future Directions
EgoEdit demonstrates interactive latency (first-frame 855 ms, steady 38.1 fps at 512×384 px/16 fps), but certain limitations exist:
- Resolution and frame rate, while sufficient for AR prototypes, are below standard consumer video (480p+).
- Slight temporal seams can appear at chunk boundaries in streaming mode.
- Diminished performance is observed for highly out-of-distribution instructions or heavy object occlusions.
Suggested future research directions include the development of higher-resolution, higher-frame rate streaming pipelines (potentially with further distillation or compact model experts), exploration of multi-view or stereo egocentric editing for 3D AR, explicit incorporation of motion cues such as optical flow, and extending the instruction set for multi-stage or interactive in-loop editing.
In sum, EgoEdit, together with EgoEditData and EgoEditBench, provides a cohesive framework for real-time, instruction-following editing of egocentric video. The architecture and benchmark set a new standard for research on live AR video editing in first-person contexts (Li et al., 5 Dec 2025).