$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving (2405.20323v1)

Published 30 May 2024 in cs.CV and cs.AI

Abstract: Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: https://github.com/nnanhuang/S3Gaussian/.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a self-supervised method that extends 3D Gaussians to 4D to capture both dynamic and static scene elements.
It utilizes a Multi-resolution Hexplane Encoder and a Multi-head Gaussian Decoder to efficiently decode spatial-temporal features.
The approach outperforms benchmarks with high PSNR and SSIM scores on Waymo and StreetGaussian datasets, reducing the need for costly annotations.

Self-Supervised Street Scene Reconstruction Using 3D Gaussian Splatting

Introduction

The paper presents a novel method for photorealistic 3D reconstruction of street scenes, termed Self-Supervised Street Gaussian ( $S^3$ Gaussian). Addressing a critical need for autonomous driving simulators, the research revolves around decomposing dynamic and static elements without relying on costly 3D annotations that conventional methods typically require. The research compares sharply against techniques such as Neural Radiance Fields (NeRF) but highlights 3D Gaussian Splatting (3DGS) for faster processing speeds and precise representations.

Methodology

The proposed $S^3$ Gaussian approach introduces a robust solution for decomposing dynamic and static elements in street scenes. It leverages 3D Gaussian representations, enhancing these with a spatial-temporal field network that employs a Multi-resolution Hexplane Structure Encoder and a Multi-head Gaussian Decoder.

The methodology consists of the following core elements:

3D to 4D Gaussian Extension: The paper extends 3D Gaussians to 4D, allowing representation of both position and covariance, alongside opacity values and spherical harmonic color coefficients. This enhancement captures dynamic changes over time, represented by deformation offsets and canonical attribute alterations.
Hexplane Structure Encoder: This encodes the 4D grid into multi-resolution feature planes, effectively aggregating spatial and temporal information.
Multi-head Gaussian Decoder: It decodes the Hexplane features into semantic features, positions, and spherical harmonic coefficients, facilitating dynamic alterations in the scene representation. The approach employs a self-supervised optimization process, utilizing LiDAR prior initialization to fine-tune the Gaussian attributes and relying on multiple loss functions to refine the rendered scene.

Results

The $S^3$ Gaussian method was rigorously tested on the Waymo-Open dataset, particularly focusing on its dynamic32 (D32) and static32 (S32) subsets, as well as scenarios curated by StreetGaussian. Quantitative metrics like PSNR, SSIM, and LPIPS were utilized for evaluation.

Comparison with State-of-the-Art

On the Waymo-NOTR dataset, the results were:

PSNR for scene reconstruction: Outperformed benchmarks with a score of 31.35 on D32 and 30.73 on S32.
SSIM for scene reconstruction: Achieved 0.911 on D32, surpassing other methods.
PSNR for novel view synthesis (NVS): Scored 27.44 on D32, indicating superior rendering quality.

On the StreetGaussian dataset, $S^3$ Gaussian recorded a PSNR of 34.61 and SSIM of 0.950, nearly matching state-of-the-art results even without explicit supervision.

Qualitative Analysis

The qualitative comparison demonstrated $S^3$ Gaussian's prowess in accurate reconstruction and novel view synthesis. Significant improvements in rendering details, such as eliminating ghosting effects and blurriness in dynamic scenes, were evident. The method also showed high sensitivity to detail variations, like changing traffic light colors, setting it apart from current techniques.

Implications and Future Work

The development of $S^3$ Gaussian lays the groundwork for enhancing real-world simulators, critical for advancing autonomous driving technologies. By eliminating the reliance on costly data annotations, this method significantly widens the applicability and efficacy of 3D reconstruction in various dynamic environments.

Future research could explore the following dimensions:

Improving Hexplane encoding to manage even more complex and extensive geographic data.
Enhancing the multi-head Gaussian decoder for even higher fidelity and real-time processing capabilities.
Extending the framework to incorporate additional sensor data types, such as radar or thermal imaging, thus mimicking a more comprehensive sensory suite used in modern autonomous systems.

Conclusion

The $S^3$ Gaussian framework proposes a groundbreaking self-supervised method for reconstructing dynamic urban scenes. It manages to perform on par, if not better, than methods necessitating extensive annotations. This advancement in 3D Gaussian Splatting is pivotal for autonomous driving research, enhancing the realism and accuracy of real-world driving simulators.

By addressing the challenge of annotating dynamic 3D scenes, $S^3$ Gaussian stands as a significant contribution to the field, anticipating further developments that integrate multi-modal sensory data for even richer scene reconstruction and AI-driven decision-making.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/rsasaki0109/status/1800890799653363903

https://twitter.com/zhenjun_zhao/status/1796454256993849398

https://twitter.com/pythontrending/status/1798352863812346229