Motion-Injected Inpainting Model
- The paper demonstrates a novel approach that decouples appearance reconstruction from motion propagation to achieve coherent inpainting across video frames.
- It employs explicit optical flow, temporal attention, and dual-stream fusion to maintain consistent motion dynamics and spatial fidelity in masked regions.
- The model's dual-branch design effectively blends spatial anchors with motion cues, enhancing video object removal, multi-view editing, and dynamic texture restoration.
A motion-injected inpainting model refers to any inpainting paradigm—predominantly in video or sequential domains—where explicit mechanisms are introduced to encode, propagate, or synthesize motion information alongside spatial inpainting. The central objective is the restoration or editing of spatiotemporally coherent content in regions with missing or masked data, while maintaining consistent motion dynamics, appearance identity, and scene integrity across frames or views. This class of models integrates motion cues through a variety of technical strategies, including explicit optical flow propagation, motion attention modules, flow-supervised objectives, latent space gating, and hybrid architectural designs, situating itself at the intersection of generative modeling, video understanding, and controllable synthesis. Below, the technical principles, methods, and representative architectures of motion-injected inpainting are systematically described.
1. Decoupling Motion and Appearance: Foundational Principles
The essence of motion-injected inpainting is the decoupling of appearance/identity reconstruction from the propagation or synthesis of motion in masked/spatially ambiguous regions. Several prominent pipelines employ this separation:
- Spatial Anchor Construction: High-fidelity inpainting is performed on a subset of key frames or views to establish reliable spatial anchors for subsequent propagation.
- Motion Propagation Module: Dedicated modules (e.g., temporal diffusion transformers, optical flow propagators, attention-based correlation mechanisms) propagate these anchors temporally or across views, modeling plausible dynamics within masked zones.
- Dual-Stream Conditioning: Concurrent spatial and motion cues ("context priors") are fused, e.g., via control tokens, gating networks, or context-aware latent fusion, ensuring spatial fidelity and temporal coherence (Xie et al., 24 Oct 2025, Gu et al., 2023).
This architecture enables controlled, coherent content synthesis even under severe data degradation, misalignment, large occlusions, or perspective change.
2. Motion Injection Mechanisms and Model Classes
Motion injection has been operationalized via multiple approaches:
2.1 Explicit Optical Flow-Guided Propagation
Several models compute, complete, and inject optical flow fields to directly model inter-frame or inter-view correspondences.
- Flow-Guided Latent Propagation: One-step latent space feature warping is accomplished by warping features using estimated (completed) optical flow fields, typically within a pre-trained VAE latent space. Flow-guided deformable convolutional layers refine latent features in masked regions, integrating motion cues before further denoising steps (Gu et al., 2023).
- Slot Attention on Flow Embeddings: Discrete "motion slots" are derived from dense optical flow vectors using slot-attention mechanisms, producing compact motion embeddings that condition the denoising U-Net or transformer. This enables pose-free and camera-pose–agnostic multi-view inpainting (Cao et al., 2024).
2.2 Temporal Attention and Motion Capture
Temporal consistency is enforced via architectural inflation (extending 2D layers to handle temporal sequences) and injection of motion-specific attention:
- Temporal Self Attention (TSA) and Damped Global Attention (DGA): Layers performing (i) pixel-aligned self-attention across frames (TSA) and (ii) downsampled global attention across the entire spatio-temporal block (DGA) are introduced into the backbone (Zi et al., 2024). These modules implicitly capture motion without computing flow.
- Motion Module Inflation: Each standard convolutional/attention layer is augmented with temporal attention (inflated weights or dedicated QKV heads), permitting simultaneous multi-frame processing where temporal information is modeled at every spatial location (Zhang et al., 2023).
2.3 Gating and Context Fusion in Latent Space
- Learnable Context/Control Fusion: Motion priors extracted from spaced temporal sampling or context duplication are encoded and fused into early stages of a diffusion transformer via additive or multiplicative gating, often with learned scalar gates per layer (Xie et al., 24 Oct 2025, Zhang et al., 30 May 2025).
- Residual Fusion: Motion-specific attention outputs are added to the main feature stream at multiple depths, functioning as residual connections that modulate activations by learned motion correlations (Zi et al., 2024).
2.4 Dual or Multi-Guide Attention
- Dual-branch Attention Fusion: Spatial (mirrored or symmetric) and temporal (past frame) references are encoded in parallel, producing dual key/value sets for each masked frame. The denoising network conditions on a fused attention context (via learnable weights), enforcing both spatial plausibility and temporally coherent synthesis (Zhang et al., 2024).
3. Representative Architectures
The following table outlines distinguishing features of major motion-injected inpainting models:
| Model | Motion Injection Method | Principal Architecture |
|---|---|---|
| VidSplice (Xie et al., 24 Oct 2025) | CoSpliced spatial anchors, context controller with flow-matching supervision | Two-branch (spatial + motion), DiT backbone, ControlNet-style fusion |
| FGDVI (Gu et al., 2023) | Optical-flow guided latent propagation and U-Net spatiotemporal attention | LDM backbone with spatiotemporal attention, Deformable Conv |
| MVInpainter (Cao et al., 2024) | Slot-attention on RAFT-derived flow embeddings | SD1.5-inpainting U-Net with temporal transformer + Ref-KV |
| CoCoCo (Zi et al., 2024) | TSA + DGA + textual cross-attention (no flow) | UNet backbone, attention-injected motion corrs, cross-attention |
| AVID (Zhang et al., 2023) | Inflated per-layer temporal attention modules | Latent diffusion UNet with 3D/temporal attention per block |
| DiffMVR (Zhang et al., 2024) | Dual-guide fused cross-attention plus temporal latent consistency loss | SD-v1.5 U-Net, fused CLIP-guided cross-attention |
| DreamDance (Zhang et al., 30 May 2025) | Latent gating on pose/appearance with time-varying gating weights | DiT backbone, per-step gates, CLIP guidance |
| MaskedMimic (Tessler et al., 2024) | Masked motion inpainting via VAE conditioned on masked kinematics | Transformer prior + VAE policy, multi-modal embedding |
All approaches align on the necessity of explicitly or implicitly introducing motion-aware priors, whether via optical flow, temporal attention, slot-based aggregation, or context-aware gating.
4. Optimization and Loss Functions
Motion-injected inpainting models typically deploy losses that conjoin spatial reconstruction with temporal consistency supervision:
- Diffusion Denoising MSE: Ubiquitous noise-prediction MSE as the core training target, computed on masked pixels and/or masked latents.
- ID-Alignment and Flow-Matching Loss: When explicit motion (e.g., flow, slots) is estimated, additional L2 losses match network-predicted flow or propagated appearance against ground truth or computed reference (Xie et al., 24 Oct 2025, Zhang et al., 2019).
- Image-Flow Consistency Loss: For architectures hallucinating both appearance and motion (e.g., DIP-Flow (Zhang et al., 2019)), warping-based consistency constraints enforce agreement between inpainted frames and hallucinated flows.
- Motion Consistency Loss: Latent similarity penalties across consecutive frames regularize framewise outputs to preserve motion smoothness, even in the absence of explicit flow (Zhang et al., 2024).
- No Adversarial Loss: Most modern pipelines (e.g., AVID, CoCoCo) abstain from adversarial or explicit temporal discriminators, relying on architectural or attention-level design to induce coherence.
5. Applications, Capabilities, and Empirical Results
Motion-injected inpainting methodologies have demonstrated state-of-the-art results across a wide set of video restoration, content removal, editing, and controllable synthesis tasks:
- Video/Object Removal: Removal of real-world humans, objects or pedestrians with robust temporal consistency, identity preservation, and artifact suppression (Xie et al., 24 Oct 2025, Sun et al., 3 Apr 2025).
- Multi-View Editing and 3D Consistency: Pose-free, NVS-aligned multi-view object inpainting, inpainting-based 3D scene editing, and consistent object replacement via motion slot-injected diffusion (Cao et al., 2024).
- Text-Guided Inpainting: High-quality, text-aligned identity and action-preserving inpainting even in dynamic, ambiguous, or open-vocabulary settings (Zi et al., 2024, Zhang et al., 2023).
- Dynamic Texture Restoration: Internal, video-specific adaptation for the restoration of stochastic, non-deterministic dynamic backgrounds without external corpus training (Cherel et al., 2023).
- Physics-Based Control: Unification of controllable character animation modalities by masking kinematic constraints and learning a conditional VAE over physical trajectories (Tessler et al., 2024).
Empirical benchmarks consistently report advances in PSNR, SSIM, FVID/FVD, LPIPS, CLIP semantic alignment, and temporal consistency metrics. Flow-guided architectures, such as FGDVI, achieve ∼10% lower flow warping error (E_{warp}) than prior state-of-the-art, and dual-guidance designs realize multiple-point improvements in both perceptual and temporal stability (Xie et al., 24 Oct 2025, Zhang et al., 2024).
6. Limitations and Open Problems
Despite broad successes, several intrinsic and practical complexities persist:
- Flow Dependency: Approaches reliant on explicit flow injection are bounded by flow estimation accuracy and robustness to occlusion or scene ambiguity (Gu et al., 2023).
- Temporal Hallucination: Models lacking physical or learned priors can propagate artifacts or introduce non-causal trajectories, particularly with long or corrupted sequences.
- Scale and Efficiency: Latent attention inflation, slot aggregation, or dual-branch conditioning impose additional computation. Latent interpolation and anchor-frame strategies partially alleviate runtime cost (Gu et al., 2023).
- Controllability/Steerability: Balancing fine-grained user-injected motion cues against data-driven priors remains a challenge; gating and dual-guide weighting are the main mechanisms for control/creativity trade-off (Zhang et al., 30 May 2025, Zhang et al., 2024).
Research continues toward adaptive flow scheduling, high-res/hr-long video scaling, and generalized, multi-modal motion synthesis.
7. Historical Evolution and Variants
The motion-injected inpainting paradigm has evolved from:
- Variational/Optimization-Based Methods: Early joint spatial-motional energy formulations, alternating inpainting and flow estimation with PDE solvers (Lauze et al., 2018).
- Internal Learning and DIP-Flow: Extension of deep image prior schemes to learn both flow and appearance jointly via restricted-internal training (Zhang et al., 2019).
- Diffusion-Based Pipelines: Modern approaches center on diffusion models, where motion is injected using architectural, attention-driven, or flow-guided augmentations, and loss functions enforced at the latent and feature level (Gu et al., 2023, Zhang et al., 2023, Xie et al., 24 Oct 2025).
Recent developments have shifted toward learned, modular, and plug-and-play injection methods (slot attention, dual-fusion, motion modules) that are dataset-agnostic and flexibly adapt to user control or diverse scene priors.
References:
- VidSplice (Xie et al., 24 Oct 2025)
- Flow-Guided Diffusion (Gu et al., 2023)
- MVInpainter (Cao et al., 2024)
- CoCoCo (Zi et al., 2024)
- AVID (Zhang et al., 2023)
- MaskedMimic (Tessler et al., 2024)
- DreamDance (Zhang et al., 30 May 2025)
- Infusion (Cherel et al., 2023)
- Motion-Compensated Variational Inpainting (Lauze et al., 2018)
- DIP-Flow (Zhang et al., 2019)
- DiffMVR (Zhang et al., 2024)
For precise equations, ablation results, and implementation details, consult the respective arXiv primary sources.