Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model (2504.02764v1)

Published 3 Apr 2025 in cs.CV and cs.AI

Abstract: In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

Summary

An Analysis of "Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model"

The paper, "Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model," presents an innovative approach to 3D scene generation from single images, addressing inherent limitations in previous methodologies. This research introduces "Scene Splatter", a framework that employs momentum-based video diffusion to enhance both the fidelity and consistency of 3D scene generation.

Key Contributions

The authors identify fundamental limitations in existing video generation models, particularly the challenge of scene inconsistency and artifacts across generated video frames from single image input. The proposed solution, Scene Splatter, uses a momentum-based approach which incorporates both latent and pixel-level momentum to ensure consistency across generated frames and further enhance scene details.

Latent-Level Momentum: The innovation lies in constructing noisy samples from original features, referred to as latent-level momentum, to guide the denoising process in video diffusion models. This approach preserves scene consistency by preventing changes in existing components through multiple reverse diffusion steps.
Pixel-Level Momentum: For the reconstructions that span known and unknown regions, Scene Splatter employs pixel-level momentum. This aspect further refines generation capabilities, specifically in unseen regions, by combining the information from videos generated both with and without momentum.

The resultant cascaded momentum framework enables improved detail enhancement and scene consistency in novel views generated by the video diffusion models. This leads to high fidelity outputs that are validated through qualitative and quantitative comparisons with existing methodologies.

Experimental Evaluation

The paper demonstrates the robustness of Scene Splatter through extensive experiments, showing superior performance in generating high-fidelity and consistent 3D scenes from single images when compared against both regression-based (e.g., Flash3D) and generation-based methods (e.g., CogVideoX, ViewCrafter). Specifically, the method outperformed others on dataset subsets like RealEstate10K, achieving better PSNR, SSIM, and LPIPS scores, particularly in scenarios with larger discrepancies in view ranges. This validates its ability to maintain scene consistency and fidelity across varying scene complexities.

Practical and Theoretical Implications

The practical implications of this research are substantial, especially for applications in virtual reality, augmented reality, and robotics that require accurate 3D reconstructions from limited visual data. By addressing the challenge of generating consistent and high-quality 3D scenes from single 2D images, the Scene Splatter framework can significantly enhance the efficiency and effectiveness of scene generation processes in these domains.

From a theoretical standpoint, this paper introduces a novel combination of latent- and pixel-level momentum within the context of video diffusion models, paving the way for future research to explore similar strategies in other domains of generative modeling. The iterative optimization process and the integration of Gaussian representations offer a new perspective on scene reconstruction, potentially inspiring further exploration into dynamic scene generation and beyond static 3D scenes.

Future Prospects

While the Scene Splatter framework advances the field of 3D scene generation, opportunities remain for further optimization, particularly regarding the computational demands of video diffusion models and their scalability to larger datasets and more complex scenes. Additionally, expanding this approach to accommodate dynamic scenes could open new avenues for realistic animation and real-time 3D environment simulation.

Overall, the paper underlines the potential of momentum-based video diffusion models in overcoming traditional limitations in 3D scene generation, suggesting a promising trajectory for future developments in the field of AI-driven image synthesis.

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1907990457910702286