Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D StreetUnveiler with Semantic-Aware 2DGS (2405.18416v2)

Published 28 May 2024 in cs.CV

Abstract: Unveiling an empty street from crowded observations captured by in-car cameras is crucial for autonomous driving. However, removing all temporarily static objects, such as stopped vehicles and standing pedestrians, presents a significant challenge. Unlike object-centric 3D inpainting, which relies on thorough observation in a small scene, street scene cases involve long trajectories that differ from previous 3D inpainting tasks. The camera-centric moving environment of captured videos further complicates the task due to the limited degree and time duration of object observation. To address these obstacles, we introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns a 3D representation of the empty street from crowded observations. Our representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS) for its scalability and ability to identify Gaussians to be removed. We inpaint rendered image after removing unwanted Gaussians to provide pseudo-labels and subsequently re-optimize the 2DGS. Given its temporal continuous movement, we divide the empty street scene into observed, partial-observed, and unobserved regions, which we propose to locate through a rendered alpha map. This decomposition helps us to minimize the regions that need to be inpainted. To enhance the temporal consistency of the inpainting, we introduce a novel time-reversal framework to inpaint frames in reverse order and use later frames as references for earlier frames to fully utilize the long-trajectory observations. Our experiments conducted on the street scene dataset successfully reconstructed a 3D representation of the empty street. The mesh representation of the empty street can be extracted for further applications. The project page and more visualizations can be found at: https://streetunveiler.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Google street view: Capturing the world at street level. Computer, 43, 2010.
  2. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In CVPR, pages 20919–20929, 2023.
  3. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000.
  4. Leftrefill: Filling right canvas based on left reference through generalized text-to-image diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  5. Zits++: Image inpainting by improving the incremental transformer on structural priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  6. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
  7. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting, 2023.
  8. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering. arXiv:2311.18561, 2023.
  9. Gaussianpro: 3d gaussian splatting with progressive propagation. arXiv preprint arXiv:2402.14650, 2024.
  10. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision (ECCV), 2022.
  11. Gaussianeditor: Editing 3d gaussians delicately with text instructions, 2023.
  12. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In Proceedings of the International Conference on 3D Vision (3DV), 2022.
  13. Neural 3d scene reconstruction with the manhattan-world assumption. In CVPR, 2022.
  14. Streetsurf: Extending multi-view implicit surface reconstruction to street views. arXiv preprint arXiv:2306.04988, 2023.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc.
  16. Proposal-based video completion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 38–54. Springer, 2020.
  17. 2d gaussian splatting for geometrically accurate radiance fields. In SIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024.
  18. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
  19. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.
  20. Deep video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5792–5801, 2019.
  21. Auto-encoding variational bayes, 2022.
  22. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023.
  23. Decomposing nerf for editing via feature field distillation. In Advances in Neural Information Processing Systems, volume 35, 2022.
  24. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  25. Pulsar: Efficient sphere-based neural rendering. In CVPR, 2021.
  26. Towards an end-to-end framework for flow-guided video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17562–17571, 2022.
  27. Taming latent diffusion model for neural radiance field inpainting. 2024.
  28. Vastgaussian: Vast 3d gaussians for large scene reconstruction. In CVPR, 2024.
  29. Partial convolution based padding. Arxiv, 2018.
  30. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4170–4179, 2019.
  31. Reduce information loss in transformers for pluralistic image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11347–11357, 2022.
  32. Editing conditional radiance fields, 2021.
  33. Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior. arXiv preprint arXiv:2404.11613, 2024.
  34. Urban radiance field representation with deformable neural mesh primitives. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  35. Progressively optimized local radiance fields for robust view synthesis. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, pages 16539–16548, 2023.
  36. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
  37. SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023.
  38. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
  39. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  40. Openscene: 3d scene understanding with open vocabularies. 2023.
  41. Plane-based multi-view inpainting for image-based rendering in large scenes. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), 2018.
  42. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  43. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In International Conference on Computer Vision (ICCV), 2021.
  44. Urban radiance fields. CVPR, 2022.
  45. Dlformer: Discrete latent transformer for video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3511–3520, 2022.
  46. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians, 2024.
  47. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  48. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
  49. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  50. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  51. Panoptic lifting for 3d scene understanding with neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9043–9052, June 2023.
  52. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. CVPR, 2022.
  53. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  54. Resolution-robust large mask inpainting with fourier convolutions. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3172–3182, 2021.
  55. Block-NeRF: Scalable large scene neural view synthesis. arXiv, 2022.
  56. Multi-view inpainting for image-based scene editing and rendering. In Proceedings of the International Conference on 3D Vision (3DV), 2016.
  57. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In CVPR, pages 12922–12931, June 2022.
  58. Suds: Scalable urban dynamic scenes. In Computer Vision and Pattern Recognition (CVPR), 2023.
  59. High-fidelity pluralistic image completion with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4692–4701, 2021.
  60. Video inpainting by jointly learning temporal structure and spatial details. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5232–5239, 2019.
  61. Inpaintnerf360: Text-guided 3d inpainting on unbounded neural radiance fields. arXiv, 2023.
  62. F2-nerf: Fast neural radiance field training with free camera trajectories. CVPR, 2023.
  63. Repopulating street scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  64. Towards context-stable and visual-consistent image inpainting, 2024.
  65. Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal. arXiv preprint arXiv:2404.13679, 2024.
  66. Neural fields meet explicit geometric representations for inverse rendering of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
  67. Nerfiller: Completing scenes via generative 3d inpainting. In CVPR, 2024.
  68. Removing objects from neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  69. Rendering humans from object-occluded monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
  70. Segformer: Simple and efficient design for semantic segmentation with transformers. In Neural Information Processing Systems (NeurIPS), 2021.
  71. Grid-guided neural radiance fields for large urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  72. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022.
  73. Deep flow-guided video inpainting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  74. Street gaussians for modeling dynamic urban scenes. 2023.
  75. Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. 2023.
  76. Unisim: A neural closed-loop sensor simulator. In CVPR, 2023.
  77. Gaussian grouping: Segment and edit anything in 3d scenes, 2023.
  78. Contextual residual aggregation for ultra high-resolution image inpainting. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 7508–7517, 2020.
  79. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5505–5514, 2018.
  80. Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4471–4480, 2019.
  81. Nerf-editing: Geometry editing of neural radiance fields. In CVPR, 2022.
  82. Multiview scene image inpainting based on conditional generative adversarial networks. IEEE Transactions on Intelligent Vehicles, 5(2), June 2020.
  83. Arf: Artistic radiance fields, 2022.
  84. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  85. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  86. Nerflets: Local radiance fields for efficient structure-aware 3d scene representation from 2d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  87. Large scale image completion via co-modulated generative adversarial networks. In International Conference on Learning Representations, 2020.
  88. Roomdesigner: Encoding anchor-latents for style-consistent and shape-compatible indoor scene generation. In Proceedings of the International Conference on 3D Vision (3DV), 2024.
  89. Prior based human completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7951–7961, 2021.
  90. Hugs: Holistic urban 3d scene understanding via gaussian splatting, 2024.
  91. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018.
  92. ProPainter: Improving propagation and transformer for video inpainting. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
  93. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. arXiv preprint arXiv:2312.03203, 2023.
  94. Progressive temporal feature alignment network for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16448–16457, 2021.

Summary

  • The paper introduces a novel framework that uses semantic-aware 2D Gaussian Splatting and time-reversal inpainting to reconstruct empty street scenes for autonomous driving.
  • The methodology achieves superior performance by lowering LPIPS and delivering competitive FID scores on the Waymo Open dataset.
  • The approach ensures temporal consistency across long video trajectories, mitigating occlusions from transient objects and enhancing 3D scene clarity.

Reconstructing Empty Street Scenes for Autonomous Driving: Insights from "StreetUnveiler"

Introduction

The paper "StreetUnveiler" presents a methodological framework to reconstruct empty street scenes from in-car camera videos, addressing the need for autonomous vehicle systems to operate in a clear and unobstructed digital environment. Autonomous driving relies heavily on accurate 3D reconstructions of street scenes, but the presence of temporary static objects, such as parked cars and pedestrians, complicates this task. The proposed method involves novel approaches in 3D representation and inpainting to create a clean street scene, free from transient occlusions.

Methodology

2D Gaussian Splatting (2DGS)

The 2D Gaussian Splatting (2DGS) technique forms the cornerstone of the presented framework. Unlike conventional object-centric 3D inpainting methods, which work well within small and thoroughly observed environments, street scenes encompass long trajectories and limited object observation periods. The paper leverages 2DGS because of its scalability and editability, which are crucial for managing the extensive and dynamic nature of street data.

2D Gaussian splatting represents geometry using points with Gaussian distributions projected onto 2D planes, overlaying these projections to create coherent 3D representations. This approach allows for precise and efficient rendering, as well as region-specific modification, by manipulating parameters like point positions, tangential vectors, and scaling factors.

Semantic Decomposition and Inpainting Mask Generation

Critical to removing occlusions is accurately distinguishing between observable, partially observable, and completely unobservable regions. This is achieved through semantic guidance and rendered alpha maps. The process begins by associating each 2D Gaussian point with a non-trainable, hard-label semantic category, which aids in gathering points with the same semantic label and simplifies object removal.

The rendered alpha map identifies completely unobservable regions as those with low opacity values after object removal. This enables the generation of an inpainting mask focused on only these unobservable regions, subsequently reducing the inpainting complexity and enhancing the quality of the filled regions.

Time-Reversal Inpainting Framework

Maintaining temporal consistency across frames is particularly challenging in long trajectory videos. The paper introduces a time-reversal inpainting framework wherein video frames are inpainted in reverse order. This strategy leverages the more detailed and comprehensive views of objects obtained in earlier frames to guide the inpainting of later frames, ensuring conformity and minimizing discrepancies across the video sequence.

The selection of a reference-based inpainting model, specifically the diffusion-based LeftRefill method, further stabilizes the process by employing a high-resolution to low-resolution guidance approach. This method capitalizes on the extensive pixel-matching capabilities inherent to diffusion models, which ensures that the inpainted regions remain consistent with the surrounding scene when viewed from different angles.

Experimental Results

The efficacy of StreetUnveiler was validated using the Waymo Open Perception Dataset, focusing on real-world street scenes. Several performance metrics, including LPIPS and FID scores, were calculated to assess the quality of object removal and reconstruction. StreetUnveiler demonstrated superior performance compared to state-of-the-art 2D and 3D inpainting methods, achieving lower LPIPS and competitive FID values.

Qualitative analysis highlighted that the proposed method resulted in clearer and more consistent inpainting across frames, as opposed to significant blurring and inconsistency when using alternative methods. The introduction of time-reversal inpainting and 2DGS representation were pivotal elements contributing to this improved performance.

Implications and Future Work

The successful implementation of StreetUnveiler holds substantial practical and theoretical implications. From a practical standpoint, the ability to reconstruct empty street scenes can streamline the development and deployment of autonomous driving systems, enhancing their reliability by removing transient occlusions that could interfere with navigation and sensor systems.

Theoretically, this work broadens the scope of 3D scene representation and inpainting frameworks, particularly in how they handle large-scale, dynamic environments with limited observation data. Future avenues may explore the integration of this methodology with real-time processing capabilities, further optimizing the 3D modeling pipeline for use in fast-paced and variable settings such as urban traffic.

Additionally, extending the framework to incorporate more sophisticated learning mechanisms for semantic labeling, perhaps through unsupervised or semi-supervised learning paradigms, could enhance its adaptability and accuracy. Subsequent research could also investigate more robust handling of dynamic, moving objects in addition to static occlusions for a more holistic enhancement of autonomous driving environments.

Conclusion

StreetUnveiler presents a significant advancement in the reconstruction of empty street scenes by introducing innovative uses of 2D Gaussian Splatting and a time-reversal inpainting framework. The method surmounts the challenges posed by long trajectories and limited observation periods inherent to in-car camera videos, providing a robust solution for environments crucial to autonomous driving. Future research and development in this domain can build upon these findings to further refine and expand the capabilities of autonomous driving systems.

HackerNews