Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D (2008.05711v1)

Published 13 Aug 2020 in cs.CV

Abstract: The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: https://nv-tlabs.github.io/lift-splat-shoot .

Citations (861)

View on Semantic Scholar

Summary

The paper introduces the Lift-Splat architecture that unprojects 2D images into 3D BEV maps for enhanced autonomous perception.
It utilizes a three-stage process—Lift, Splat, and Shoot—to fuse data from arbitrary camera rigs without relying on LiDAR.
Empirical results on nuScenes and Lyft datasets demonstrate robust segmentation performance and resilience to sensor dropouts.

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Overview

The paper "Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D" by Jonah Philion and Sanja Fidler introduces a novel end-to-end architecture tailored for perception tasks in autonomous vehicles. This architecture, named "Lift-Splat," extracts bird's-eye-view (BEV) semantic representations directly from multi-view camera inputs, efficiently fusing these perspectives into a cohesive three-dimensional (3D) understanding of the surrounding environment. The methodology stands out in that it handles arbitrary camera rigs and does not necessitate depth information from sensors such as LiDAR, making it a versatile and potentially cost-effective solution for autonomous driving applications.

Methodology

The paper's primary contribution is the "Lift-Splat" architecture, which operates in three main stages: Lift, Splat, and Shoot.

Lift: Each camera image is independently transformed into a frustum-shaped point cloud representing potential depths. This is achieved by predicting a per-pixel context vector and a categorical distribution over depth. This step effectively "lifts" the 2D image information into a 3D space.
Splat: The point clouds from all cameras are then projected onto a common BEV plane. This involves accumulating features into a rasterized grid, analogous to sum-pooling. The focus here is on efficiently handling large volumes of spatial data, leveraging cumulative-sum tricks to enhance computational efficiency.
Shoot: The resulting BEV representation can be utilized for planning by shooting possible trajectories and selecting the optimal one based on a cost map inferred from the BEV grid. This facilitates interpretable end-to-end motion planning without requiring explicit depth information from additional sensors.

Empirical Evaluation

The authors evaluate their Lift-Splat architecture on two major datasets: nuScenes and Lyft Level 5. They benchmark against several baselines including standard convolutional neural networks, frozen encoder variants, and OFT architectures. The Lift-Splat model consistently outperforms these baselines across various tasks such as car segmentation, vehicle segmentation, lane boundary detection, and drivable area segmentation.

Results Highlights

On the nuScenes dataset, the Lift-Splat model achieves significant performance improvements with Intersection-over-Union (IoU) metrics of 32.06% for car segmentation and 19.96% for lane boundary detection.
For the Lyft dataset, it achieves comparable gains, demonstrating its robustness and generalization capabilities across different camera configurations.
When evaluated against models using oracle depth from LiDAR, the Lift-Splat model shows competitive performance, particularly in drivable area detection, though it lags slightly behind in some object segmentation tasks.

Robustness and Generalization

The paper further explores the model's resilience to common real-world issues such as camera dropout and extrinsic calibration noise. Training with dropout significantly enhances test-time performance even when certain cameras are unavailable. Additionally, the paper demonstrates zero-shot generalization capabilities by showing improved performance when previously unseen cameras are added during testing without further retraining.

Practical and Theoretical Implications

The practical implications of this research are substantial for the field of autonomous driving. The ability to generate accurate BEV semantic maps directly from camera inputs could reduce dependency on expensive LiDAR systems, potentially lowering the cost of deploying autonomous vehicles. Moreover, the robustness to camera calibration errors and missing sensors suggests greater resilience in diverse operational environments.

Theoretically, the framework pushes the boundaries of sensor fusion in autonomy, especially in how depth ambiguity is handled in monocular vision setups. By implicitly unprojecting images to 3D and leveraging effective feature pooling strategies, the Lift-Splat model paves the way for more adaptable and efficient perception systems.

Future Developments

Looking ahead, one of the most pressing future directions is the incorporation of temporal information from video sequences. Extending the Lift-Splat model to handle multiple time steps could further enhance depth inference and overall scene understanding, potentially surpassing the performance of even LiDAR-based systems.

Conclusion

The "Lift, Splat, Shoot" architecture represents a significant step forward in the pursuit of efficient and robust perception systems for autonomous driving. By providing a powerful yet flexible framework for BEV representation from arbitrary camera rigs, this research opens up numerous possibilities for both practical deployment and further academic exploration in the field of autonomous vehicles.

PDF Markdown