Wide spatial coverage and stable temporal dynamics without curated multi-view data

Establish how to synthesize high-quality 4D novel-view videos that concurrently achieve wide spatial coverage across large viewpoint changes and stable temporal dynamics over long sequences using only single-view monocular inputs, without relying on curated multi-view training data.

Background

The paper surveys diffusion-based approaches to novel view and 4D video synthesis, noting that many methods rely on curated multi-view data, accurate camera poses, or assumptions of limited dynamics. It highlights persistent difficulties in maintaining both geometric consistency across wide viewpoint changes and temporal coherence over long sequences when only monocular inputs are available.

Within this context, the authors explicitly state that achieving broad spatial coverage and stable temporal dynamics without curated multi-view data remains an unresolved challenge. Their proposed pose-free, auto-regressive inpainting framework is presented as a step toward addressing this gap, but the general problem of fully solving this setting is still open.

References

Achieving wide spatial coverage and stable temporal dynamics without curated multi-view data, therefore, remains open, a gap our pose-free, auto-regressive framework seeks to close.

— SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting (2510.26796 - Lu et al., 30 Oct 2025) in Section 2.2 (Generative Novel View Synthesis)

Wide spatial coverage and stable temporal dynamics without curated multi-view data

Background

References

Related Problems