Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FreeVS: Generative View Synthesis on Free Driving Trajectory (2410.18079v1)

Published 23 Oct 2024 in cs.CV

Abstract: Existing reconstruction-based novel view synthesis methods for driving scenes focus on synthesizing camera views along the recorded trajectory of the ego vehicle. Their image rendering performance will severely degrade on viewpoints falling out of the recorded trajectory, where camera rays are untrained. We propose FreeVS, a novel fully generative approach that can synthesize camera views on free new trajectories in real driving scenes. To control the generation results to be 3D consistent with the real scenes and accurate in viewpoint pose, we propose the pseudo-image representation of view priors to control the generation process. Viewpoint transformation simulation is applied on pseudo-images to simulate camera movement in each direction. Once trained, FreeVS can be applied to any validation sequences without reconstruction process and synthesis views on novel trajectories. Moreover, we propose two new challenging benchmarks tailored to driving scenes, which are novel camera synthesis and novel trajectory synthesis, emphasizing the freedom of viewpoints. Given that no ground truth images are available on novel trajectories, we also propose to evaluate the consistency of images synthesized on novel trajectories with 3D perception models. Experiments on the Waymo Open Dataset show that FreeVS has a strong image synthesis performance on both the recorded trajectories and novel trajectories. Project Page: https://freevs24.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5470–5479, 2022.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  3. Generative novel view synthesis with 3d-aware diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4217–4229, 2023.
  4. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering. arXiv preprint arXiv:2311.18561, 2023.
  5. Streetsurf: Extending multi-view implicit surface reconstruction to street views. arXiv preprint arXiv:2306.04988, 2023.
  6. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  7. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  8. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
  9. Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. arXiv preprint arXiv:2206.07705, 2022.
  10. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  11. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  12. Neural scene rasterization for large scene rendering in real time. In ICCV, 2023a.
  13. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9298–9309, 2023b.
  14. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  15. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11976–11986, 2022.
  16. Urban radiance field representation with deformable neural mesh primitives. ICCV, 2023a.
  17. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. arXiv preprint arXiv:2312.02934, 2023b.
  18. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
  19. Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8446–8455, 2023.
  20. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  21. Neural scene graphs for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2856–2865, 2021.
  22. Neural point light fields. CVPR, 2022.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  24. Urban radiance fields. In CVPR, 2022.
  25. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  26. Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0.
  27. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  28. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  29. Block-nerf: Scalable large scene neural view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8248–8258, 2022.
  30. Neurad: Neural rendering for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14895–14904, 2024.
  31. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In CVPR, 2022.
  32. Suds: Scalable urban dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12375–12385, 2023.
  33. MV-FCOS3D++: Multi-View camera-only 4d object detection with pretrained monocular backbones. arXiv preprint, 2022.
  34. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023a.
  35. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14749–14759, 2024.
  36. Neural fields meet explicit geometric representations for inverse rendering of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8370–8380, 2023b.
  37. Mapnerf: Incorporating map priors into neural radiance fields for driving view simulation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  7082–7088. IEEE, 2023a.
  38. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023b.
  39. Reconfusion: 3d reconstruction with diffusion priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21551–21561, 2024.
  40. Mars: An instance-aware, modular and realistic simulator for autonomous driving. CICAI, 2023c.
  41. S-nerf: Neural radiance fields for street views. arXiv preprint arXiv:2303.00749, 2023.
  42. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  43. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In ECCV, 2024.
  44. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077, 2023a.
  45. Generalized predictive model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14662–14672, 2024.
  46. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1389–1399, 2023b.
  47. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv 2310.10642, 2023c.
  48. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023d.
  49. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048, 2024a.
  50. Sgd: Street view synthesis with gaussian splatting and diffusion prior. arXiv preprint arXiv:2403.20079, 2024b.
  51. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  52. Hugs: Holistic urban 3d scene understanding via gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21336–21345, 2024a.
  53. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21634–21643, 2024b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qitai Wang (4 papers)
  2. Lue Fan (26 papers)
  3. Yuqi Wang (62 papers)
  4. Yuntao Chen (37 papers)
  5. Zhaoxiang Zhang (162 papers)

Summary

Overview of FreeVS: Generative View Synthesis on Free Driving Trajectories

The paper introduces FreeVS, a novel fully generative approach to view synthesis in dynamic driving scenes. Traditional methods are limited to generating views along pre-recorded trajectories, degrading significantly when extrapolating to novel viewpoints. FreeVS tackles this limitation by utilizing a generative paradigm, capable of producing high-fidelity camera views for arbitrary trajectories without explicit 3D reconstruction processes.

Methodology

FreeVS leverages a pseudo-image representation to accurately model the 3D scene priors necessary for generating realistic views. This approach mitigates common issues such as maintaining 3D geometrical consistency and precise camera pose control. By simulating camera movements using viewpoint transformation on pseudo-images, FreeVS facilitates the synthesis for trajectories outside the recorded path.

Training involves constructing pseudo-images from LiDAR-generated colored 3D point clouds, projected into the desired target view. During inference, FreeVS employs a diffusion model conditioned on these pseudo-images to synthesize views from pure noise. This eliminates the need for ground truth images in unrecorded trajectories by relying instead on 3D perception models to assess image consistency.

Evaluation and Results

The paper assesses FreeVS across two novel benchmarks designed for driving scenes:

  1. Novel Camera Synthesis: This benchmark evaluates FreeVS's ability to synthesize unseen camera views by withholding specific camera data during training and requiring the generation of these views during testing.
  2. Novel Trajectory Synthesis: Here, performance is evaluated on new driving paths where no ground truth exists. Instead, synthesized views are validated through their compatibility with existing 3D detection models, capturing perceptual robustness.

FreeVS demonstrates superior performance against SOTA methods like 3D Gaussian Splatting and street-level NeRF models, excelling particularly in scenarios demanding high fidelity and geometrical reliability. Its reduced computational overhead during inference and robustness to diverse camera movements further underscore its efficiency.

Implications and Future Directions

FreeVS's ability to generate consistent and high-quality views from minimal data inputs positions it as a pivotal advancement in autonomous driving simulations and embodied AI systems. By eliminating the need for extensive data reconstruction, FreeVS offers a scalable solution adaptable to various simulation environments.

Future research might focus on refining the integration of pseudo-images with other sensory data, expanding the range of applicable scenarios, or enhancing computational efficiency further. The methodology presents a potential leap towards more autonomous simulation environments, fostering increased realism and immersion in virtual spaces.

In conclusion, this research presents significant advancements in addressing the fidelity and flexibility demands of novel view synthesis, with promising implications for future AI and robotics applications.