BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs (2203.04050v3)

Published 8 Mar 2022 in cs.CV

Abstract: Semantic segmentation in bird's eye view (BEV) is an important task for autonomous driving. Though this task has attracted a large amount of research efforts, it is still challenging to flexibly cope with arbitrary (single or multiple) camera sensors equipped on the autonomous vehicle. In this paper, we present BEVSegFormer, an effective transformer-based method for BEV semantic segmentation from arbitrary camera rigs. Specifically, our method first encodes image features from arbitrary cameras with a shared backbone. These image features are then enhanced by a deformable transformer-based encoder. Moreover, we introduce a BEV transformer decoder module to parse BEV semantic segmentation results. An efficient multi-camera deformable attention unit is designed to carry out the BEV-to-image view transformation. Finally, the queries are reshaped according the layout of grids in the BEV, and upsampled to produce the semantic segmentation result in a supervised manner. We evaluate the proposed algorithm on the public nuScenes dataset and a self-collected dataset. Experimental results show that our method achieves promising performance on BEV semantic segmentation from arbitrary camera rigs. We also demonstrate the effectiveness of each component via ablation study.

Citations (114)

View on Semantic Scholar

Summary

The paper presents a novel transformer approach that uses multi-camera deformable cross-attention to implicitly transform 2D images into BEV space without explicit calibration.
It achieves state-of-the-art performance on the nuScenes dataset with a mIoU of 44.55, outperforming methods like HDMapNet and supporting flexible camera configurations.
Ablation studies confirm that components such as deformable self-attention and increased transformer layers significantly boost segmentation accuracy.

This paper introduces BEVSegFormer, a transformer-based model for Bird's Eye View (BEV) semantic segmentation using images from arbitrary camera configurations (single or multiple). The core challenge addressed is performing accurate view transformation from 2D image space to BEV space without relying heavily on explicit camera calibration parameters or intermediate representations like depth maps, which can be error-prone or inflexible.

The BEVSegFormer architecture consists of three main components:

Shared Backbone: A standard CNN backbone (e.g., ResNet) is used to extract multi-scale features from each input camera image independently. The same backbone weights are shared across all cameras.
Transformer Encoder: Inspired by Deformable DETR, this module enhances the extracted image features using multi-scale deformable self-attention. This allows the model to focus on salient image regions efficiently. Learnable scale-level position embeddings are added.
BEV Transformer Decoder: This is the key component for view transformation and segmentation.
- It initializes a grid of dense BEV queries, where each query corresponds to a location in the target BEV map.
- A novel Multi-Camera Deformable Cross-Attention module is introduced. For each BEV query, this module learns:
  - Reference point coordinates directly on the feature maps of each camera. This differs from methods like DETR3D which project 3D points onto images using camera extrinsics. BEVSegFormer's approach avoids the need for explicit camera parameters during view transformation.
  - Sampling offsets around these reference points.
  - Attention weights for the sampled features.
- The module aggregates features from relevant locations across all camera views based on the learned attention weights, effectively performing the BEV-to-image transformation implicitly. The mathematical formulation is given as:
  
  $\text{MultiCameraDeformAttn}(\bm{z}_q, \hat{p}_q ,\left \{ x^{c} \right \}_{c=1}^{N_c}) = \sum_{m=1}^{M} \bm{W}_m [ \sum_{c=1}^{N_c} \sum_{k=1}^{K} A_{mcqk} \cdot \bm{W}_{m}^{'} \bm{x^{c}(\phi_c( \bm{\hat{p}_q ) + \Delta \bm{P}_{mcqk})]$
  
  where $\bm{z}_q$ is the BEV query, $\hat{p}_q$ are the learned reference points, $x^c$ are camera features, $A_{mcqk}$ are attention weights, and $\Delta \bm{P}_{mcqk}$ are sampling offsets.
BEV Semantic Decoder: The output query features from the transformer decoder are reshaped into a 2D spatial grid corresponding to the BEV map layout. This feature map is then upsampled using convolutional layers and bilinear interpolation to produce the final semantic segmentation map for the BEV space.

Experiments were conducted on the nuScenes dataset (using both 6 surrounding cameras and just the front camera) and a custom dataset ("NuLLMax Front Camera"). The model was trained using a weighted cross-entropy loss and the AdamW optimizer.

Key results include:

BEVSegFormer achieved state-of-the-art results on the nuScenes validation set for BEV semantic segmentation (Divider, Ped Crossing, Boundary classes) without using temporal information, significantly outperforming methods like IPM, Lift-Splat-Shoot, VPN, and HDMapNet. For example, it achieved a mIoU of 44.55 compared to HDMapNet's 32.9.
The method demonstrated flexibility by performing well using only a single front camera.
Ablation studies confirmed the effectiveness of the proposed components:
- The multi-camera deformable cross-attention provided significant gains over standard cross-attention and converged faster.
- Using deformable self-attention in the encoder also improved performance.
- Adding learnable camera position embeddings helped standard attention but offered marginal benefit with the deformable cross-attention, suggesting the latter inherently learns camera spatial relationships via the reference points.
- Increasing the number of transformer layers and using a stronger backbone (ResNet-101 vs. ResNet-34) further boosted performance.
Visualizations of the attention maps showed that BEV queries correctly attended to relevant image regions in the corresponding camera views.

In conclusion, BEVSegFormer presents an effective transformer-based approach for BEV semantic segmentation that handles arbitrary camera rigs by learning the view transformation implicitly through a novel multi-camera deformable cross-attention mechanism, eliminating the need for explicit camera extrinsic parameters in the transformation process.

PDF Markdown

Related Papers

YouTube

Show All Videos