SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Published 30 Oct 2025 in cs.CV and cs.GR | (2510.26796v1)

Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

Abstract PDF Upgrade to Chat

Summary

The paper presents a pose-free, trajectory-to-camera framework that decouples camera motion from scene dynamics using a view-conditional inpainting model.
The paper details a spatiotemporal auto-regressive inference pipeline that employs a spatial-temporal transformer to ensure cross-view coherence and robust video inpainting.
The paper demonstrates state-of-the-art 4D scene reconstruction performance on real-world datasets, enabling applications in robotics, autonomous driving, and immersive media.

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Introduction and Motivation

SEE4D addresses the challenge of synthesizing spatiotemporal 4D content from monocular, casually captured videos without requiring explicit 3D or camera pose supervision. Existing video-to-4D methods typically rely on pose annotations or trajectory-conditioned models, which are impractical for in-the-wild footage and often entangle camera motion with scene dynamics, leading to unstable inference and limited scalability. SEE4D introduces a trajectory-to-camera formulation, decoupling camera control from scene modeling by rendering to a bank of fixed virtual cameras and leveraging a view-conditional video inpainting model. This approach eliminates the need for explicit 3D annotations and enables robust, temporally consistent 4D scene generation suitable for immersive applications.

Methodology

Trajectory-to-Camera Formulation

SEE4D replaces explicit trajectory prediction with rendering to a set of fixed virtual cameras. This trajectory-to-camera approach simplifies the learning problem by decoupling camera motion from scene dynamics, allowing the model to focus on generating consistent content across a bank of synchronized, fixed-view videos.

View-Conditional Inpainting Model

The core of SEE4D is a view-conditional inpainting model based on a latent diffusion architecture. The model is conditioned on depth-warped images and associated masks, rather than explicit camera poses. The conditioning tensor fuses the warped image, mask, and step-dependent noisy latent, which is then processed by a spatial-temporal transformer backbone to enforce cross-view and cross-frame coherence.

Figure 1: Overview of the view-conditional inpainting model, showing the conditioning on depth-warped images and masks, and the spatial-temporal transformer backbone for denoising and inpainting.

Realistic Warp Synthesis

To simulate test-time artifacts, SEE4D perturbs target views by forward-projecting to randomly sampled virtual camera poses and back-projecting with jitter, introducing realistic warping noise and occlusion patterns. This process generates supervision signals that closely match the distribution of artifacts encountered during inference, improving robustness to depth noise and dynamic scene content.

Noise-Adaptive Conditioning

The informativeness of the warp condition is modulated by the density of the warp mask, with unreliable warps down-weighted via increased noise injection. This prevents the model from overfitting to imperfect geometric cues and encourages robust inpainting, especially in regions with large occlusions or depth errors.

Spatial-Temporal Transformer Backbone

A lightweight spatio-temporal transformer backbone incorporates frame-time embeddings and spatial-temporal attention, enabling the model to maintain temporal consistency and cross-view coherence. This is critical for handling dynamic scenes and ensuring that generated content remains stable across both time and viewpoint changes.

Spatiotemporal Auto-Regressive Inference

SEE4D introduces a dual auto-regressive inference pipeline:

Spatial Expansion: The overall viewpoint shift is decomposed into a sequence of small, spline-interpolated hops between virtual cameras. At each step, the current video is warped to the next pose, inpainted, and depth is re-estimated, progressively mitigating projection errors and occlusions.
Temporal Recursion: Long sequences are generated by sliding a window across the video, with overlapping frames serving as temporal anchors to enforce consistency and eliminate seam artifacts.
Figure 2: Spatiotemporal auto-regressive inference pipeline, decomposing warps into small hops and using overlapping windows for seamless long video generation.

Experimental Results

4D Scene Reconstruction

SEE4D achieves state-of-the-art performance on the iPhone dataset, outperforming prior methods such as TrajectoryCrafter, ReCamMaster, and Shape-of-Motion in PSNR, SSIM, and LPIPS metrics. The model demonstrates robustness to both rigid and deformable content, and its pose-free conditioning enables generalization beyond the training distribution.

Figure 3: Qualitative comparisons for 4D reconstruction on the iPhone dataset, showing sharper geometry and stable parallax.

Cross-View Video Generation

On the VBench protocol, SEE4D leads on five of six metrics, including subject and background consistency, temporal flicker, and image quality, despite using a multi-view synthesis backbone rather than a dedicated video generation model. The spatial-temporal auto-regressive pipeline is critical for maintaining coherence over long sequences and large viewpoint shifts.

Figure 4: Qualitative comparisons with TrajectoryCrafter for 4D generation, highlighting smooth, occlusion-aware synthesis and seamless transitions.

Ablation Studies

Ablations confirm the necessity of each component:

Replacing realistic warp synthesis with simpler schemes degrades all metrics.
Omitting noise-adaptive conditioning or the spatial-temporal transformer backbone results in significant drops in reconstruction fidelity.
Removing spatial or temporal auto-regression leads to noticeable quality declines, underscoring the importance of both progressive view hops and overlapping diffusion windows.

Downstream Applications

SEE4D enables a range of downstream applications by generating temporally aligned, viewpoint-varying video sequences from a single input:

Robotics: Augments grasp planning with multi-view object sequences, improving surface geometry estimation and approach direction assessment.
Autonomous Driving: Synthesizes additional perspectives from dash-cam footage, enhancing visual coverage for perception and tracking.
Interactive Gaming and Virtual Environments: Transforms single-view gameplay into multi-angle sequences for immersive replay and camera switching.
Cinematic Post-Production: Enables off-axis re-framing and stabilization from handheld footage, filling occlusions and maintaining motion continuity.
Figure 5: Illustrative examples of SEE4D for robotics, driving, games, and movies, demonstrating broad applicability.

Implications and Future Directions

SEE4D demonstrates that pose-free, trajectory-to-camera 4D generation is feasible and effective, removing the reliance on explicit pose annotations and enabling scalable, robust 4D scene modeling from monocular videos. The integration of depth-warped conditioning, noise-adaptive modulation, and spatiotemporal transformers provides a strong geometry prior and temporal consistency, advancing the state of the art in 4D content creation.

Future research directions include:

Extending the approach to handle even wider baseline changes and more complex dynamic scenes.
Integrating with downstream 4D representation learning frameworks for end-to-end VR content pipelines.
Exploring self-supervised or unsupervised depth estimation to further reduce reliance on synthetic data.

Conclusion

SEE4D introduces a pose-free, trajectory-to-camera framework for 4D scene generation from monocular videos, leveraging a view-conditional inpainting model and a spatiotemporal auto-regressive inference pipeline. The method achieves superior reconstruction accuracy and generative quality, with demonstrated benefits across robotics, autonomous driving, gaming, and cinematic applications. The decoupling of camera control from scene modeling, combined with robust geometric and temporal priors, positions SEE4D as a practical solution for scalable 4D world modeling and immersive content creation.