HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting (2403.12722v1)
Abstract: Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis, parsing semantic labels, and tracking moving objects. Despite considerable progress, existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy, and reconstruct dynamic scenes, even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 demonstrate the effectiveness of our approach.
- Building Rome in a day. Communications of the ACM, 54(10):105–112, 2011.
- Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields, 2021. arXiv:2103.13415 [cs].
- Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields, 2022. arXiv:2111.12077 [cs].
- Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields, 2023. arXiv:2304.06706 [cs].
- InverseForm: A Loss Function for Structured Boundary-Aware Segmentation, 2021a. arXiv:2104.02745 [cs].
- InverseForm: A Loss Function for Structured Boundary-Aware Segmentation, 2021b. arXiv:2104.02745 [cs].
- Virtual KITTI 2, 2020. arXiv:2001.10773 [cs, eess].
- Category Level Object Pose Estimation via Neural Analysis-by-Synthesis, 2020. arXiv:2008.08145 [cs].
- Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation, 2020. arXiv:1911.10194 [cs].
- Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans, 2021.
- Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation, 2022. arXiv:2203.15224 [cs].
- Piecewise planar and non-planar stereo for urban scene reconstruction. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1418–1425, San Francisco, CA, USA, 2010. IEEE.
- Are we ready for autonomous driving? The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, Providence, RI, 2012. IEEE.
- Bayes’ Rays: Uncertainty Quantification for Neural Radiance Fields, 2023. arXiv:2309.03185 [cs].
- StreetSurf: Extending Multi-view Implicit Surface Reconstruction to Street Views, 2023. arXiv:2306.04988 [cs].
- Monocular Quasi-Dense 3D Object Tracking, 2021. arXiv:2103.07351 [cs].
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. 42(4).
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
- Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation, 2022. arXiv:2205.04334 [cs].
- A Hybrid Multiview Stereo Algorithm for Modeling Urban Scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):5–17, 2013.
- KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2022.
- Urban Radiance Field Representation with Deformable Neural Mesh Primitives, 2023. arXiv:2307.10776 [cs].
- Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis, 2023. arXiv:2308.09713 [cs].
- NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections, 2021. arXiv:2008.02268 [cs].
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, 2020. arXiv:2003.08934 [cs].
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):1–15, 2022.
- Neural Scene Graphs for Dynamic Scenes, 2021. arXiv:2011.10379 [cs].
- iDisc: Internal Discretization for Monocular Depth Estimation, 2023. arXiv:2304.06334 [cs].
- PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Urban Radiance Fields, 2021. arXiv:2111.14643 [cs].
- Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5565–5574, New Orleans, LA, USA, 2022. IEEE.
- Structure-from-Motion Revisited. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, Las Vegas, NV, USA, 2016. IEEE.
- Block-NeRF: Scalable Large Scene Neural View Synthesis, 2022. arXiv:2202.05263 [cs].
- Nerfstudio: A Modular Framework for Neural Radiance Field Development. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings, pages 1–12, 2023. arXiv:2302.04264 [cs].
- Hierarchical Multi-Scale Attention for Semantic Segmentation, 2020. arXiv:2005.10821 [cs].
- SUDS: Scalable Urban Dynamic Scenes, 2023. arXiv:2303.14536 [cs].
- NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes, 2021. arXiv:2111.13260 [cs].
- F$^{2}$-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories, 2023. arXiv:2303.15951 [cs].
- Behind the Scenes: Density Fields for Single View Reconstruction, 2023. arXiv:2301.07668 [cs].
- MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving, 2023. arXiv:2307.15058 [cs].
- Unifying Flow, Stereo and Depth Estimation, 2023a. arXiv:2211.05783 [cs].
- Unifying Flow, Stereo and Depth Estimation, 2023b. arXiv:2211.05783 [cs].
- 4K4D: Real-Time 4D View Synthesis at 4K Resolution, 2023c. arXiv:2310.11448 [cs].
- Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13759–13768, Montreal, QC, Canada, 2021. IEEE.
- EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision, 2023a. arXiv:2311.02077 [cs].
- UniSim: A Neural Closed-Loop Sensor Simulator.
- Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction, 2023b. arXiv:2309.13101 [cs].
- Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting, 2023c. arXiv:2310.10642 [cs].
- MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction, 2022. arXiv:2206.00665 [cs].
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, Salt Lake City, UT, 2018. IEEE.
- Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervisio, 2023. arXiv:2303.03361 [cs].
- In-Place Scene Labelling and Understanding with Implicit Scene Representation, 2021. arXiv:2103.15875 [cs].
- Drivable 3D Gaussian Avatars, 2023. arXiv:2311.08581 [cs].
- EWA splatting. IEEE Transactions on Visualization and Computer Graphics, 8(3):223–238, 2002.