EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Published 16 Feb 2026 in cs.CV | (2602.15031v1)

Abstract: High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a framework that disentangles local and global contexts to enable efficient real-time video editing with diffusion models.
It employs a sparse local context module and a lightweight temporal global embedder to achieve 10× compute efficiency and 50× faster frame generation.
Empirical results demonstrate high-fidelity inpainting and temporal consistency, with performance validated on benchmarks like VPBench-Edit and DAVIS.

Disentangled Local and Global Control in Real-Time Generative Video Editing with EditCtrl

Overview and Motivation

This essay examines "EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing" (2602.15031), a work addressing efficiency and quality challenges in prompt-guided video inpainting and editing with diffusion models. Existing methods, particularly those leveraging full-attention pretrained video foundation models, suffer from prohibitive computational cost by unnecessarily processing the entire video context regardless of the spatial extent of edits. EditCtrl introduces a framework that disentangles localized generation from global video context integration, yielding compute that scales with edit mask size while preserving high-fidelity generative priors and temporal consistency.

Figure 1: EditCtrl supports efficient prompt-guided edits on 4K videos, dynamically allocating compute proportional to mask size and propagating object edits with temporal consistency.

Methodological Innovations

EditCtrl diverges from prior dense full-attention generative editing paradigms through architectural disentangling of local and global context in the editing process. The system comprises two principal components:

Sparse Local Context Module: Operates exclusively on tokens within user-defined edit masks, applying a fine-tuned local context encoder to sparse regions. Mask dilation is employed to address inadequate blending at mask boundaries, mitigating artifacts caused by spatial sparsity.
Lightweight Temporal Global Context Embedder: Obtains a compact representation of downsampled background video using a VAE encoder; global context tokens encode high-level temporal cues (structure, lighting, motion, camera dynamics) and are fused into generation through cross-attention modulation, enhancing coherence across frames.

This design yields editing compute proportional to edit mask area, independent of input resolution, and adapters are trained without altering base diffusion model weights, ensuring compatibility with autoregressive content propagation.

Figure 2: The EditCtrl video diffusion framework: global context from downsampled background and local context from edit masks are injected via trainable adapters into a pretrained text-to-video diffusion model.

Figure 3: Disentangled local and global modules extract background tokens and masked regions, modulating transformer layers for localized generation aligned with video-wide context.

Interactive and Real-Time Editing Capabilities

EditCtrl’s computational decoupling enables broad interactive editing capabilities, including:

Arbitrary-resolution Real-time Editing: Throughput remains invariant to input size, facilitating prompt-guided edits even on 4K video in real-time.
Multi-region Inpainting: Supports multiple concurrent edit masks and prompts, with each region processed independently and results merged post-diffusion.
Autoregressive Content Propagation: Can propagate edits temporally using distilled autoregressive diffusion models, extending edits to future frames and enabling deployment in latency-sensitive settings such as augmented reality.
Figure 4: EditCtrl propagates edits in augmented reality applications, providing low-latency content that matches user movement.

Empirical Results and Validation

EditCtrl achieves a $10 \times$ improvement in compute efficiency versus full-attention generative baselines, while matching or exceeding editing and inpainting quality. Performance is demonstrated on VPBench-Edit, DAVIS, and VPBench-Inp benchmarks—scenarios with diverse prompts and masks:

Masked region preservation outperforms or matches state-of-the-art, with strong PSNR, SSIM, and LPIPS metrics.
Semantic alignment with text prompts demonstrates robust integration of high-level generative cues.
Temporal coherence scores are competitive, facilitated by the global context embedder.

EditCtrl's edit throughput (FPS) is notably 50 $\times$ faster than generative baselines operating with full-attention, without perceptual degradation.

Figure 5: Video inpainting comparison: EditCtrl generates high-fidelity content with scene alignment using less compute; baselines struggle despite full-attention.

Ablation studies confirm necessity of both local and global adapters: naïve sparse local editing without global context or proper fine-tuning degrades visual coherence and prompt-responsiveness; combined adapters restore and enhance edit quality.

Figure 6: Removing local or global adapters undermines editing performance; EditCtrl with both matches full-attention quality at higher efficiency.

Qualitative Demonstrations

Qualitative results reveal EditCtrl’s ability to:

Perform localized semantic edits (e.g., “change flower color to blue”).
Edit multiple regions or objects within the same frame using distinct prompts.
Inpaint content that blends structurally and visually, rather than generic patch filling.
Maintain object and background integrity over complex temporal dynamics.

Figure 7: Prompt-guided local edit: EditCtrl accurately changes flower color while preserving scene texture.

Figure 8: Additional comparisons: EditCtrl edits scenes ("Add a hot air balloon", "Turn the sunglasses shiny red") with high spatial and temporal consistency.

Figure 9: EditCtrl enables semantic object replacement or augmentation across frames ("Turn the donkey into a zebra", "Add a bandana to the dog").

Figure 10: EditCtrl supports simultaneous multi-prompt generation for distinct regions.

Figure 11: Content propagation visualizations, maintaining temporal integrity during autoregressive edit extension.

Implications and Future Directions

EditCtrl’s disentangled framework critically addresses the scalability bottlenecks in video generative editing and inpainting. The architectural separation sets the stage for future developments in:

Integration of motion-aware and geometry-aware context cues.
Efficient cross-frame edit propagation in unconstrained or dynamic video feeds.
Plug-and-play augmentation for domain-specific editing models (e.g., AR, VR, cinema post-production).
Adaptation of the local/global paradigm to multi-modal generative control (e.g., joint audio/video/text editing).

Theoretical implications include the efficient exploitation of pretrained generative priors under localized supervision, modularity in control parameterization, and minimal perturbation of large foundation models during task adaptation.

Conclusion

EditCtrl presents a principled solution to high-fidelity, interactive, and real-time generative video editing with diffusion models, leveraging the disentanglement of local and global context for prompt-guided inpainting. Empirical evidence supports both efficiency and quality claims: compute scales with mask size, adapters guarantee generative priors are preserved, and content propagation unlocks practical deployment scenarios. With this method’s modularity and extension potential, EditCtrl is positioned as a robust foundation for advanced, scalable video editing frameworks.

Markdown Report Issue