Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks (2003.13402v1)

Published 30 Mar 2020 in cs.CV

Abstract: Autonomous vehicles commonly rely on highly detailed birds-eye-view maps of their environment, which capture both static elements of the scene such as road layout as well as dynamic elements such as other cars and pedestrians. Generating these map representations on the fly is a complex multi-stage process which incorporates many important vision-based elements, including ground plane estimation, road segmentation and 3D object detection. In this work we present a simple, unified approach for estimating maps directly from monocular images using a single end-to-end deep learning architecture. For the maps themselves we adopt a semantic Bayesian occupancy grid framework, allowing us to trivially accumulate information over multiple cameras and timesteps. We demonstrate the effectiveness of our approach by evaluating against several challenging baselines on the NuScenes and Argoverse datasets, and show that we are able to achieve a relative improvement of 9.1% and 22.3% respectively compared to the best-performing existing method.

Citations (236)

View on Semantic Scholar

Summary

The paper introduces a unified deep learning architecture with Pyramid Occupancy Networks for direct BEV semantic map prediction from monocular images.
It employs a Dense Transformer Layer and Multiscale Transformer Pyramid to convert and refine image features into the BEV space across varying depths.
The approach achieves notable IoU gains on NuScenes (9.1%) and Argoverse (22.3%), enhancing efficiency for autonomous navigation.

Predicting Semantic Map Representations from Images Using Pyramid Occupancy Networks

The paper presents a seminal contribution to the field of autonomous vehicles by introducing a robust end-to-end deep learning architecture for predicting semantic map representations directly from monocular images. This approach streamlines the complex multi-stage process typically involved in generating birds-eye-view (BEV) maps, which are crucial for autonomous navigation through the integration of tasks like ground plane estimation, road segmentation, and 3D object detection.

Key Contributions

The authors propose a unified framework utilizing a deep convolutional neural network (CNN) architecture underpinned by a novel Pyramid Occupancy Network (PyrOccNet). The structure of PyrOccNet comprises several notable components:

Dense Transformer Layer: This innovation facilitates the conversion of image-based features into the BEV space by leveraging camera geometry. The layer condenses vertical and channel dimensions of the image feature map into a bottleneck, maintaining spatial context within the horizontal dimension through 1D convolution.
Multiscale Transformer Pyramid: To manage variations in depth within the perspective image, the network employs multiple transformers acting on a pyramid of feature maps from the backbone network's residual layers. This supports varied sampling rates relative to the camera's depth, ensuring that features retain fidelity across different scales.
Semantic Bayesian Occupancy Grids: The adoption of a probabilistic framework allows for the aggregation of information across sensors and time, enabling the network to compute occupancy probabilities for different semantic classes. This framework naturally supports the fusion of data from multiple viewpoints and temporal frames.

Results

The paper reports results from evaluations on the NuScenes and Argoverse datasets, demonstrating significant improvements over existing methods like the Variational Encoder-Decoder (VED) and View Parsing Network (VPN). Specifically, the proposed approach achieves a relative improvement in Intersection over Union (IoU) scores by 9.1% on NuScenes and 22.3% on Argoverse, showcasing enhanced discriminative power across both larger and finer classes within the datasets.

Implications and Future Directions

This research holds considerable implications for the development of robust, cost-effective autonomous vehicles capable of real-time environmental mapping using common image sensors. The end-to-end nature of the proposed architecture simplifies the pipeline, enhancing computational efficiency and scalability.

The theoretical merits of this unified deep learning approach extend beyond autonomous driving. It illustrates the potential for BEV map predictions in other robotics applications, potentially prompting further exploration into generalized spatial reasoning from visual data streams. Future work could delve into augmenting the model's capabilities to include lane detection and trajectory prediction, broadening its utility within the autonomous systems domain.

Conclusion

By forging a direct path from monocular image inputs to detailed BEV map outputs, this paper lays the groundwork for future innovations in the autonomous driving landscape. The combination of dense transformer layers and a multiscale pyramid architecture represents a considerable step forward, offering a refined perspective on how vision-based systems can understand and navigate their environments.

PDF Markdown

Related Papers

GitHub

GitHub - tom-roddick/mono-semantic-maps (320 stars)