Test-Time Natural Video Completion
- Test-time natural video completion is a method that generates temporally and spatially coherent video frames using sparse observations and large-scale pre-trained models like diffusion models and VAEs.
- It leverages in-context conditioning, uncertainty-aware fusion, and limited test-time fine-tuning to adapt generic priors to diverse scenes without scene-specific retraining.
- The approach enables practical applications including novel-view synthesis, video inpainting, and motion extension, achieving improved quantitative metrics such as PSNR and SSIM on benchmark datasets.
Test-time natural video completion refers to the process of synthesizing missing or hypothetical portions of a video sequence at inference time using prior knowledge from pre-trained models, often in contexts where conventional per-scene or sequence-specific training is not feasible. The goal is to generate temporally and spatially coherent frames, guided solely by sparse input observations, masks, patches, or control signals, to fill in spatial gaps, interpolate unobserved viewpoints, extend motion, or perform inpainting, typically under a zero-shot or lightly fine-tuned regime. This paradigm unifies various practical scenarios: novel-view synthesis from few glimpses, arbitrary spatiotemporal conditioning, video inpainting under high uncertainty, and completing natural video sequences for both creative and scientific purposes.
1. Theoretical Foundations and Problem Formalization
Test-time natural video completion departs from earlier paradigms that treat video completion or synthesis as a purely supervised learning activity with expansive pre-collected data and scene-specific retraining. The approach instead relies on inference-time algorithms that exploit the inductive biases of large-scale, pre-trained video models, particularly diffusion models and hybrid causal VAEs, to hallucinate plausible content from partial observations without additional scene-specific learning.
Formally, the task is often defined as follows. Given sparse or incomplete observations —which may include a subset of frames, spatial patches, or novel views—and possibly auxiliary conditions (such as text instructions or trajectory descriptors), the completion system synthesizes the full video , where encodes a masking or conditioning strategy appropriate to the input space (Fu et al., 2022, Cai et al., 9 Oct 2025). This formulation subsumes prediction (from initial frames), infilling (from ends), and arbitrary patch-based completion. The difficulty arises because available evidence is highly ambiguous, and the output distribution must cover plausible natural video continuations respecting both semantic and physical constraints.
2. Key Methodological Advances
Test-time natural video completion leverages recent innovations in generative modeling, sequence conditioning, and geometric feedback:
- Pretrained Video Diffusion Models: Large-scale models, such as Stable Video Diffusion, provide powerful priors over video sequences and allow for flexible conditional sampling at inference without further scene-specific training (Xu et al., 22 Nov 2025, Lee et al., 21 Aug 2024).
- In-Context Conditioning (ICC) and Hybrid Conditioning: ICC wraps conditioning information (patches, frames, latent codes, or text) into the token sequence processed by a frozen transformer backbone. Spatial placement is encoded by zero-padding, temporal placement by fractional rotary positional embeddings (Temporal RoPE Interpolation), resolving ambiguity introduced by temporally-compressed VAEs (Cai et al., 9 Oct 2025).
- Uncertainty-aware Fusion: When guidance from geometric priors (e.g., 3D Gaussian splatting models) is unreliable (in occluded or ambiguous regions), completion is regulated via uncertainty-weighted blending of guidance and generative predictions, ensuring naturalistic outputs even in under-observed zones (Xu et al., 22 Nov 2025).
- Test-time Finetuning Loops: For challenging settings (e.g., monocular dynamic scenes), limited affordance for on-the-fly fine-tuning of small network blocks (especially temporal transformers) can be incorporated to adapt generic priors to idiosyncrasies of the test sequence in a few hundred gradient steps (Chen et al., 16 Jul 2025, Lee et al., 21 Aug 2024).
- Latent Propagation and Deformable Noise Alignment: Specialized modules propagate and align information temporally within the latent space to fill missing regions in the initial frames based on future context, maintaining temporal coherence across completions (Lee et al., 21 Aug 2024).
3. Representative Architectures and Pipelines
Several notable system architectures exemplify the state-of-the-art in test-time natural video completion:
| Architecture/Paper | Input Conditioning | Backbone | Output |
|---|---|---|---|
| VideoCanvas (Cai et al., 9 Oct 2025) | Arbitrary patches at any time/space | DiT + VAE, ICC | Consistent completed video |
| Test-Time Video Diffusion (Xu et al., 22 Nov 2025) | Sparse images at novel or in-between views | Stable Video Diff., 3D-GS feedback | Hallucinated video trajectory |
| FFF-VDI (Lee et al., 21 Aug 2024) | Masked input video, masks | 3D-U-Net (stable video diff.), FFF module | Inpainted video |
| CogNVS (Chen et al., 16 Jul 2025) | Partial co-visible renders in novel views | Video diffusion + VAE; test-time fine-tuning | Completed dynamic scene video |
VideoCanvas addresses the placement of arbitrary spatio-temporal conditions by hybrid conditioning: each patch or frame is encoded independently, spatially zero-padded, and temporally positioned via fractional RoPE; inference concatenates these tokens with noisy latent slots to enable free-form "video painting" (Cai et al., 9 Oct 2025). Test-Time Video Diffusion for novel view synthesis hallucinates intermediate views using pretrained diffusion, with uncertainty-aware blending of 2D guidance from geometric proxies (e.g., 3D-GS), iteratively improving 3D and 2D consistency (Xu et al., 22 Nov 2025). FFF-VDI combines DDIM inversion, latent propagation, deformable noise alignment, and slight per-video fine-tuning for robust inpainting (Lee et al., 21 Aug 2024).
4. Evaluation Protocols and Empirical Benchmarks
Test-time natural video completion is evaluated on both canonical and newly designed benchmarks that stress spatial, temporal, and generalization capabilities:
- VideoCanvasBench: Covers arbitrary patch-to-video, image-to-video, and video-to-video transitions under both homologous (same scene) and non-homologous (cross-scene) configurations; metrics include PSNR, FVD, LAION-AES, MUSIQ, CLIP-CSCV, Dynamic Degree, and expert user studies (Cai et al., 9 Oct 2025).
- Novel View Synthesis Datasets: LLFF, DTU, DL3DV, and MipNeRF-360 offer evaluation in sparse-input, high-ambiguity scenarios, measuring PSNR, SSIM, and LPIPS (Xu et al., 22 Nov 2025).
- Standard Inpainting Metrics: PSNR, SSIM, VFID, and flow-warp error are used to quantify the fidelity and consistency of inpainted outputs in semantically rich videos (e.g., YouTube-VOS, DAVIS) (Lee et al., 21 Aug 2024).
Test-time diffusion approaches consistently outperform conventional baselines, especially under extreme measurement sparsity. For example, on LLFF with 3 inputs, the uncertainty-aware diffusion framework achieves 20.51 dB PSNR, 0.840 SSIM, and 0.137 LPIPS, compared to 20.44/0.702/0.207 for leading 3D-GS methods (Xu et al., 22 Nov 2025). On VideoCanvasBench, ICC conditioning achieves best FVD and Dynamic Degree for arbitrary patch completion tasks (Cai et al., 9 Oct 2025). In inpainting, FFF-VDI reaches PSNR 35.06 and SSIM 0.9812, surpassing optical flow based methods (Lee et al., 21 Aug 2024).
5. Closed-loop 3D–2D Feedback Mechanisms
A central innovation is coupling 3D scene reconstruction and 2D generative video synthesis in an iterative feedback loop. Methods such as "Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion" (Xu et al., 22 Nov 2025) instantiate the following cycle:
- Initialize 3D-GS on sparse views.
- Generate intermediate ("pseudo-view") frames at novel camera poses using test-time diffusion, conditioned by 3D-GS guidance and uncertainty masking.
- Densify Gaussians in the 3D model using synthetic views, filtered for geometric reliability.
- Jointly retrain the 3D-GS with both observed and synthetic supervision, iterating the procedure.
This closed loop allows both the geometric proxy (for rendering and depth estimation) and the diffusion prior (for photo-realistic frame generation) to compensate for each other's weaknesses, producing consistent video interpolations even under severe observation sparsity. The use of uncertainty maps, which combine geometric (alignment) and photometric (color difference) measures to gate the influence of guidance, is critical; ablating this mechanism degrades PSNR by 0.75 dB or more on challenging datasets (Xu et al., 22 Nov 2025).
6. Limitations, Open Challenges, and Future Directions
Despite substantial progress, test-time natural video completion faces several persistent challenges:
- Conditioning Overhead and Latency: Encoding multiple conditioning patches independently increases transformer sequence length and inference time, which scales with the number of conditions (Cai et al., 9 Oct 2025).
- Sparse Region Ambiguity: Identity drift and blurriness can occur if observed regions are extremely sparse or irregularly distributed along time (Cai et al., 9 Oct 2025, Xu et al., 22 Nov 2025).
- Scene Generalization versus Per-Sequence Adaptation: Zero-shot approaches achieve strong broad prior coverage, but can underperform in videos with idiosyncratic motion or lighting; test-time fine-tuning can recover specificity but at cost of computational efficiency (Chen et al., 16 Jul 2025, Lee et al., 21 Aug 2024).
- Out-of-Distribution Generalization: VAE or diffusion models trained on natural imagery may degrade on inputs with drastically different styles or domains (Cai et al., 9 Oct 2025).
- Limited Geometry Under Uncertainty: Even with uncertainty-aware fusion, regions with little or uninformative guidance still require generative "hallucination," which may not always yield physically correct or semantically valid content (Xu et al., 22 Nov 2025).
Open problems include developing adaptive token selection or attention-pruning for efficient test-time conditioning, tighter integration of 3D and 2D representations with mutual uncertainty estimation, expansion to streaming and prolonged-duration video, and enforcing stronger common-sense or physics-based priors for extrapolative generation. Future work also points towards variable-length decoding and deeper integration with diffusion backbones for increased sample diversity (Cai et al., 9 Oct 2025, Fu et al., 2022).
7. Relationship to Prior Paradigms
Test-time natural video completion generalizes and unifies several historical approaches:
- Classic video inpainting: Early methods typically framed completion as low-rank or TV-regularized tensor completion, iteratively reconstructing missing entries under global or local smoothness priors (Ko et al., 2018). These approaches scale poorly to semantic gaps or video with dynamic scenes and generally lack strong generative priors.
- Propagation-based and optical flow video inpainting: Modernity introduced flow-based completion, but such techniques are sensitive to flow accuracy and suffer from error propagation, especially under large or moving masks (Lee et al., 21 Aug 2024).
- Feed-forward generative models: Pre-trained video diffusion and VQ-based transformers now offer highly expressive, conditional sequence generation, enabling the zero-shot, patch-based, and uncertainty-aware test-time algorithms detailed above (Fu et al., 2022, Cai et al., 9 Oct 2025, Xu et al., 22 Nov 2025).
The convergence of geometric proxies, large-scale generative models, and flexible test-time conditioning constitutes the state-of-the-art in natural video completion and reflects a broader trend in visual AI toward inference-time modular composition of priors and task signals.