EgoEdit: Real-Time Egocentric Video Editing

Updated 11 December 2025

The paper presents EgoEdit, a unified framework that integrates a bespoke dataset, a high-throughput streaming model, and a standardized benchmark for egocentric video editing.
The model leverages latent-space concatenation and a flow-matching training objective to ensure temporal stability and instruction fidelity in dynamic AR scenarios.
A dual-phase distillation pipeline reduces inference steps drastically, achieving a first-frame latency of 855 ms and a sustained performance of 38.1 fps on a single GPU.

EgoEdit is a real-time egocentric video editing model developed to address unique challenges in first-person video editing, such as rapid egomotion, frequent hand–object occlusions, and the requirements for interactive augmented reality (AR) applications. Existing video editing pipelines, primarily optimized for third-person footage, exhibit significant latency and lack robustness to the domain gap presented by egocentric data. EgoEdit introduces a unified ecosystem consisting of a bespoke dataset (EgoEditData), a high-throughput streaming model, and a standardized evaluation benchmark (EgoEditBench), collectively enabling temporally stable, instruction-faithful egocentric video editing at interactive latency on a single GPU (Li et al., 5 Dec 2025).

1. Model Architecture and Editing Framework

EgoEdit builds on a state-of-the-art text-to-video diffusion-transformer architecture, specifically incorporating a Wan 2.1 latent autoencoder paired with a 10.7B-parameter DiT (Diffusion Transformer) backbone composed of 32 transformer blocks (hidden dimension 4096; 32 heads per block). Each transformer block sequentially applies:

Self-attention (with Rotary Position Embeddings and QK-normalization)
Cross-attention over fused T5+CLIP text encoder tokens
MLP layers, all modulated by timestep embeddings

The video editing paradigm is realized via latent-space operations. To adapt the foundation generator for editing, EgoEdit concatenates the source latent video $X_0$ and noisy target $X_t$ along the channel dimension. This preserves computational efficiency—self-attention complexity remains quadratic in spatial–temporal patch count—and facilitates edit conditioning. The model receives text instructions $c$ through cross-attention at each block.

The denoiser $G$ is trained to predict the instantaneous velocity $v_t$ along a linear flow between latent noise and the target edited latent, as specified by the flow-matching objective: $\mathcal{L}_\mathrm{RF} = \mathbb{E}_{t, X \sim d, n \sim \mathcal{N}(0, I)} \left\| G((1-t)X + t n, t) - (X - n) \right\|_2^2$ This objective, a variant of Rectified Flow, holds for pretraining and editing fine-tuning. Temporal stability is derived not from explicit egomotion-aware modules but from (a) channel-wise video conditioning, (b) training on egocentric data exhibiting rapid motion, and (c) the model's global spatiotemporal attention blocks.

2. Streaming Inference and Latency Reduction

Traditional diffusion-based editors exhibit frame-wise latency unsuitable for interactive scenarios (e.g., $\approx 13$ seconds per clip for 40–80 model evaluations on an H100 GPU). EgoEdit achieves real-time streaming via a dual-phase distillation pipeline:

Bidirectional DMD Distillation: Compresses a 40-step bidirectional teacher into a 4-step student via distilled guidance, reducing Number of Function Evaluations (NFEs) from 80 to 4 and increasing model-only throughput to $\approx 237$ fps.
Autoregressive Self-Forcing: The distilled 4-step student is rolled out in autoregressive $\approx 3$ -frame chunks, minimizing exposure bias by correcting own outputs based on a DMD-style loss relative to the teacher's trajectory. Decoding occurs chunk-wise, yielding a first-frame latency of 855 ms (including 562 ms for recording, 217 ms autoencoder encode/decode, and 75 ms model inference) and a steady end-to-end throughput of 38.1 fps.

Streaming inference operates by encoding the initial source video chunk, denoising with instruction conditioning, decoding output frames, and sliding the latent window forward. No additional hand-mask or explicit temporal losses are introduced; stability must emerge from model capacity and training data.

3. EgoEditData: Egocentric Editing Dataset

EgoEditData is a curated corpus comprising 99.7K before/after edit pairs across 49.7K unique clips, designed to address the egocentric–third-person domain gap. Data acquisition and curation involve multiple automated and manual stages:

Stage	Methodology/Toolchain	Retention Rate
Video Selection	Ego4D/EgoExo4D filtering (GoPro, quality)	1.8%
Hand Masking	WiLoR hand detection $\rightarrow$ SAM2	49.6%
Object Naming	Qwen2.5-VL visual-LLM	—
Object Masking	Grounded-SAM + SAM2, filtered by skeleton	—
Object Editing	GPT-5 (objects/prompts), Qwen-Image,	37.8% of generated
	Wan-VACE, human filtering	edits retained
Instruction Synthesis	GPT-5 refinement	—

Dataset statistics include 10.9K real videos and 38.8K synthetic edits, with an average of 3.6 edited variants per source. Edit types are distributed as follows: 54K Change Object, 39K Change Object + Effects, 3.6K Add Object, and 2.4K Remove Object. There are 13.6K distinct target objects and 3.2K unique source objects; natural language instructions average ≈378 characters.

4. Evaluation and Benchmarking with EgoEditBench

EgoEditBench standardizes evaluation across 15 egocentric editing tasks, encompassing Add/Remove Object, Change Object/Background/Camera Pose, Add Effect/Stylization/Reasoning, X-to-Video (Depth, Sketch, Pose), Video-to-X tasks, and Multi-Task combinations.

Primary evaluation metrics are aligned with EditVerseBench:

VLM-Eval: Visual-LLM score of instruction alignment (primary metric)
PickScore (PS): Visual quality, 5-point scale
Text Alignment (TA): CLIP-based frame–instruction match
Temporal Consistency (TC): CLIP flow between adjacent frames

VLM-Eval exhibits strong (≈86%) agreement with human preference compared to LucyEdit and InsV2V, making it the principal ranking signal.

Method	VLM-Eval ↑	PickScore ↑	TA ↑	TC ↑
EgoEdit	7.76	19.21	16.89	96.70
LucyEdit	5.44	18.87	15.03	94.41
InsV2V	5.24	18.81	14.92	94.01
TokenFlow/STDF	~5.0	18.9	15.7	—
SENORITA-2M	~7.5	—	—	—
AnyV2V	~7.5	—	—	—
StreamDiffusion	2.5–4.3	—	—	—

EgoEdit approaches state-of-the-art performance even on third-person EditVerseBench (e.g., VLM 8.00 vs. baseline 8.26). Ablation studies indicate that distillation (EgoEdit → DMD → RT) preserves VLM ≈ 7.7 at vastly reduced inference steps, and that fine-tuning data quantity is critical (VLM 4.87 → 7.85 for 0% → 100% dataset fractions).

5. Training Objectives and Optimization

Training stages—comprising pretraining, editing fine-tuning, and distillation—all follow variants of the flow-matching objective. The core loss is: $\mathcal{L}_\mathrm{RF} = \mathbb{E}_{t, X \sim d, n \sim \mathcal{N}(0, I)} \lVert G(X_t, t) - (X - n) \rVert_2^2,\ \textrm{with}\ X_t = (1-t)X + n t$ For DMD distillation, a student $G_s$ is optimized to match the teacher $G_t$ : $\mathcal{L}_\mathrm{DMD} = \mathbb{E}_{t, X_t} \lVert G_s(X_t, t) - G_t(X_t, t) \rVert_2^2$ Autoregressive self-forcing is implemented by rolling out the student on chunks and comparing the denoising trajectory with the teacher's bidirectional path. No additional explicit hand-mask or temporal losses are applied; rather, temporal coherence is afforded by model design and the diversity of training data.

6. Limitations and Future Directions

EgoEdit demonstrates interactive latency (first-frame 855 ms, steady 38.1 fps at 512×384 px/16 fps), but certain limitations exist:

Resolution and frame rate, while sufficient for AR prototypes, are below standard consumer video (480p+).
Slight temporal seams can appear at chunk boundaries in streaming mode.
Diminished performance is observed for highly out-of-distribution instructions or heavy object occlusions.

Suggested future research directions include the development of higher-resolution, higher-frame rate streaming pipelines (potentially with further distillation or compact model experts), exploration of multi-view or stereo egocentric editing for 3D AR, explicit incorporation of motion cues such as optical flow, and extending the instruction set for multi-stage or interactive in-loop editing.

In sum, EgoEdit, together with EgoEditData and EgoEditBench, provides a cohesive framework for real-time, instruction-following editing of egocentric video. The architecture and benchmark set a new standard for research on live AR video editing in first-person contexts (Li et al., 5 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EgoEdit: Real-Time Egocentric Video Editing Model.