Motion-Injected Inpainting Model

Updated 29 January 2026

The paper demonstrates a novel approach that decouples appearance reconstruction from motion propagation to achieve coherent inpainting across video frames.
It employs explicit optical flow, temporal attention, and dual-stream fusion to maintain consistent motion dynamics and spatial fidelity in masked regions.
The model's dual-branch design effectively blends spatial anchors with motion cues, enhancing video object removal, multi-view editing, and dynamic texture restoration.

A motion-injected inpainting model refers to any inpainting paradigm—predominantly in video or sequential domains—where explicit mechanisms are introduced to encode, propagate, or synthesize motion information alongside spatial inpainting. The central objective is the restoration or editing of spatiotemporally coherent content in regions with missing or masked data, while maintaining consistent motion dynamics, appearance identity, and scene integrity across frames or views. This class of models integrates motion cues through a variety of technical strategies, including explicit optical flow propagation, motion attention modules, flow-supervised objectives, latent space gating, and hybrid architectural designs, situating itself at the intersection of generative modeling, video understanding, and controllable synthesis. Below, the technical principles, methods, and representative architectures of motion-injected inpainting are systematically described.

1. Decoupling Motion and Appearance: Foundational Principles

The essence of motion-injected inpainting is the decoupling of appearance/identity reconstruction from the propagation or synthesis of motion in masked/spatially ambiguous regions. Several prominent pipelines employ this separation:

Spatial Anchor Construction: High-fidelity inpainting is performed on a subset of key frames or views to establish reliable spatial anchors for subsequent propagation.
Motion Propagation Module: Dedicated modules (e.g., temporal diffusion transformers, optical flow propagators, attention-based correlation mechanisms) propagate these anchors temporally or across views, modeling plausible dynamics within masked zones.
Dual-Stream Conditioning: Concurrent spatial and motion cues ("context priors") are fused, e.g., via control tokens, gating networks, or context-aware latent fusion, ensuring spatial fidelity and temporal coherence (Xie et al., 24 Oct 2025, Gu et al., 2023).

This architecture enables controlled, coherent content synthesis even under severe data degradation, misalignment, large occlusions, or perspective change.

2. Motion Injection Mechanisms and Model Classes

Motion injection has been operationalized via multiple approaches:

2.1 Explicit Optical Flow-Guided Propagation

Several models compute, complete, and inject optical flow fields to directly model inter-frame or inter-view correspondences.

Flow-Guided Latent Propagation: One-step latent space feature warping is accomplished by warping features using estimated (completed) optical flow fields, typically within a pre-trained VAE latent space. Flow-guided deformable convolutional layers refine latent features in masked regions, integrating motion cues before further denoising steps (Gu et al., 2023).
Slot Attention on Flow Embeddings: Discrete "motion slots" are derived from dense optical flow vectors using slot-attention mechanisms, producing compact motion embeddings that condition the denoising U-Net or transformer. This enables pose-free and camera-pose–agnostic multi-view inpainting (Cao et al., 2024).

2.2 Temporal Attention and Motion Capture

Temporal consistency is enforced via architectural inflation (extending 2D layers to handle temporal sequences) and injection of motion-specific attention:

Temporal Self Attention (TSA) and Damped Global Attention (DGA): Layers performing (i) pixel-aligned self-attention across frames (TSA) and (ii) downsampled global attention across the entire spatio-temporal block (DGA) are introduced into the backbone (Zi et al., 2024). These modules implicitly capture motion without computing flow.
Motion Module Inflation: Each standard convolutional/attention layer is augmented with temporal attention (inflated weights or dedicated QKV heads), permitting simultaneous multi-frame processing where temporal information is modeled at every spatial location (Zhang et al., 2023).

2.3 Gating and Context Fusion in Latent Space

Learnable Context/Control Fusion: Motion priors extracted from spaced temporal sampling or context duplication are encoded and fused into early stages of a diffusion transformer via additive or multiplicative gating, often with learned scalar gates per layer (Xie et al., 24 Oct 2025, Zhang et al., 30 May 2025).
Residual Fusion: Motion-specific attention outputs are added to the main feature stream at multiple depths, functioning as residual connections that modulate activations by learned motion correlations (Zi et al., 2024).

2.4 Dual or Multi-Guide Attention

Dual-branch Attention Fusion: Spatial (mirrored or symmetric) and temporal (past frame) references are encoded in parallel, producing dual key/value sets for each masked frame. The denoising network conditions on a fused attention context (via learnable weights), enforcing both spatial plausibility and temporally coherent synthesis (Zhang et al., 2024).

3. Representative Architectures

The following table outlines distinguishing features of major motion-injected inpainting models:

Model	Motion Injection Method	Principal Architecture
VidSplice (Xie et al., 24 Oct 2025)	CoSpliced spatial anchors, context controller with flow-matching supervision	Two-branch (spatial + motion), DiT backbone, ControlNet-style fusion
FGDVI (Gu et al., 2023)	Optical-flow guided latent propagation and U-Net spatiotemporal attention	LDM backbone with spatiotemporal attention, Deformable Conv
MVInpainter (Cao et al., 2024)	Slot-attention on RAFT-derived flow embeddings	SD1.5-inpainting U-Net with temporal transformer + Ref-KV
CoCoCo (Zi et al., 2024)	TSA + DGA + textual cross-attention (no flow)	UNet backbone, attention-injected motion corrs, cross-attention
AVID (Zhang et al., 2023)	Inflated per-layer temporal attention modules	Latent diffusion UNet with 3D/temporal attention per block
DiffMVR (Zhang et al., 2024)	Dual-guide fused cross-attention plus temporal latent consistency loss	SD-v1.5 U-Net, fused CLIP-guided cross-attention
DreamDance (Zhang et al., 30 May 2025)	Latent gating on pose/appearance with time-varying gating weights	DiT backbone, per-step gates, CLIP guidance
MaskedMimic (Tessler et al., 2024)	Masked motion inpainting via VAE conditioned on masked kinematics	Transformer prior + VAE policy, multi-modal embedding

All approaches align on the necessity of explicitly or implicitly introducing motion-aware priors, whether via optical flow, temporal attention, slot-based aggregation, or context-aware gating.

4. Optimization and Loss Functions

Motion-injected inpainting models typically deploy losses that conjoin spatial reconstruction with temporal consistency supervision:

Diffusion Denoising MSE: Ubiquitous noise-prediction MSE as the core training target, computed on masked pixels and/or masked latents.
ID-Alignment and Flow-Matching Loss: When explicit motion (e.g., flow, slots) is estimated, additional L2 losses match network-predicted flow or propagated appearance against ground truth or computed reference (Xie et al., 24 Oct 2025, Zhang et al., 2019).
Image-Flow Consistency Loss: For architectures hallucinating both appearance and motion (e.g., DIP-Flow (Zhang et al., 2019)), warping-based consistency constraints enforce agreement between inpainted frames and hallucinated flows.
Motion Consistency Loss: Latent similarity penalties across consecutive frames regularize framewise outputs to preserve motion smoothness, even in the absence of explicit flow (Zhang et al., 2024).
No Adversarial Loss: Most modern pipelines (e.g., AVID, CoCoCo) abstain from adversarial or explicit temporal discriminators, relying on architectural or attention-level design to induce coherence.

5. Applications, Capabilities, and Empirical Results

Motion-injected inpainting methodologies have demonstrated state-of-the-art results across a wide set of video restoration, content removal, editing, and controllable synthesis tasks:

Video/Object Removal: Removal of real-world humans, objects or pedestrians with robust temporal consistency, identity preservation, and artifact suppression (Xie et al., 24 Oct 2025, Sun et al., 3 Apr 2025).
Multi-View Editing and 3D Consistency: Pose-free, NVS-aligned multi-view object inpainting, inpainting-based 3D scene editing, and consistent object replacement via motion slot-injected diffusion (Cao et al., 2024).
Text-Guided Inpainting: High-quality, text-aligned identity and action-preserving inpainting even in dynamic, ambiguous, or open-vocabulary settings (Zi et al., 2024, Zhang et al., 2023).
Dynamic Texture Restoration: Internal, video-specific adaptation for the restoration of stochastic, non-deterministic dynamic backgrounds without external corpus training (Cherel et al., 2023).
Physics-Based Control: Unification of controllable character animation modalities by masking kinematic constraints and learning a conditional VAE over physical trajectories (Tessler et al., 2024).

Empirical benchmarks consistently report advances in PSNR, SSIM, FVID/FVD, LPIPS, CLIP semantic alignment, and temporal consistency metrics. Flow-guided architectures, such as FGDVI, achieve ∼10% lower flow warping error (E_{warp}) than prior state-of-the-art, and dual-guidance designs realize multiple-point improvements in both perceptual and temporal stability (Xie et al., 24 Oct 2025, Zhang et al., 2024).

6. Limitations and Open Problems

Despite broad successes, several intrinsic and practical complexities persist:

Flow Dependency: Approaches reliant on explicit flow injection are bounded by flow estimation accuracy and robustness to occlusion or scene ambiguity (Gu et al., 2023).
Temporal Hallucination: Models lacking physical or learned priors can propagate artifacts or introduce non-causal trajectories, particularly with long or corrupted sequences.
Scale and Efficiency: Latent attention inflation, slot aggregation, or dual-branch conditioning impose additional computation. Latent interpolation and anchor-frame strategies partially alleviate runtime cost (Gu et al., 2023).
Controllability/Steerability: Balancing fine-grained user-injected motion cues against data-driven priors remains a challenge; gating and dual-guide weighting are the main mechanisms for control/creativity trade-off (Zhang et al., 30 May 2025, Zhang et al., 2024).

Research continues toward adaptive flow scheduling, high-res/hr-long video scaling, and generalized, multi-modal motion synthesis.

7. Historical Evolution and Variants

The motion-injected inpainting paradigm has evolved from:

Variational/Optimization-Based Methods: Early joint spatial-motional energy formulations, alternating inpainting and flow estimation with PDE solvers (Lauze et al., 2018).
Internal Learning and DIP-Flow: Extension of deep image prior schemes to learn both flow and appearance jointly via restricted-internal training (Zhang et al., 2019).
Diffusion-Based Pipelines: Modern approaches center on diffusion models, where motion is injected using architectural, attention-driven, or flow-guided augmentations, and loss functions enforced at the latent and feature level (Gu et al., 2023, Zhang et al., 2023, Xie et al., 24 Oct 2025).

Recent developments have shifted toward learned, modular, and plug-and-play injection methods (slot attention, dual-fusion, motion modules) that are dataset-agnostic and flexibly adapt to user control or diverse scene priors.

References: