WidthFormer: Toward Efficient Transformer-based BEV View Transformation (2401.03836v5)

Published 8 Jan 2024 in cs.CV

Abstract: We present WidthFormer, a novel transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to compute high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently proposed works, we further improve our model's efficiency by vertically compressing the image features when serving as attention keys and values, and then we develop two modules to compensate for potential information loss due to feature compression. Experimental evaluation on the widely-used nuScenes 3D object detection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using $256\times 704$ input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 computation solutions. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github.com/ChenhongyiYang/WidthFormer .

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel transformer framework with Reference Positional Encoding (RefPE) that delivers high-quality BEV maps using only one decoder layer.
It employs vertical feature compression along with auxiliary modules to reduce computational overhead while preserving critical information.
Experiments on the nuScenes benchmark show that WidthFormer achieves 1.5 ms latency on an NVIDIA 3090 GPU, demonstrating its real-time applicability.

WidthFormer: Efficient BEV View Transformation for 3D Detection

In the domain of autonomous driving, 3D object detection has crucial implications for developing systems capable of understanding complex road environments. The paper under discussion introduces WidthFormer, a novel transformer-based approach designed to facilitate efficient Bird's-Eye-View (BEV) transformations, a critical step in 3D object detection. Specifically aimed at real-time applications, WidthFormer prioritizes computational efficiency while maintaining robust performance.

A key feature of WidthFormer is the implementation of a novel 3D positional encoding mechanism termed Reference Positional Encoding (RefPE). This mechanism enhances the capability of the model by accurately encapsulating 3D geometric information, thereby enabling the generation of high-quality BEV representations with just a single transformer decoder layer. Notably, RefPE also boosts the performance of existing sparse 3D object detectors, indicating its feasibility as a versatile tool in the domain.

From a computational standpoint, WidthFormer introduces an innovative approach by vertically compressing image features which subsequently serve as attention keys and values. This compression step considerably reduces the computational overhead associated with managing large amounts of data inherent to high-dimensional image inputs. Two additional modules are integrated to mitigate potential information loss due to feature compression, thereby ensuring the robustness of the BEV transformations.

Experimental evaluations conducted on the nuScenes 3D object detection benchmark reveal that WidthFormer surpasses previous methods in both performance and efficiency across different 3D detection architectures. For example, with input images of size 256x704 pixels, WidthFormer achieves remarkably low latency times of 1.5 ms on an NVIDIA 3090 GPU, showcasing its capacity for real-time processing. Moreover, it demonstrates strong robustness against varying degrees of camera perturbations, a common concern in dynamic, real-world environments where sensor alignments may be affected by external factors.

The implications of this research are manifold. Practically, WidthFormer offers an efficient solution for BEV transformations that is deployable on edge-computing devices without necessitating any special engineering efforts. Theoretically, it enriches the ongoing discourse on transformer applications within 3D perception tasks, opening pathways for further refinements in positional encoding mechanisms and feature compression techniques.

Looking forward, the integration of WidthFormer in real-world autonomous driving systems could be a focal point, with potential explorations into adaptive learning strategies to further enhance robustness. Additionally, the versatility of RefPE may inspire its application and adaptation in other 3D spatial tasks beyond object detection. This paper, through its methodological innovations and empirical validations, lays foundational work not only for efficient BEV transformations but also provides a reference point for future advancements in the landscape of autonomous driving and 3D perception.

PDF Markdown

Related Papers

GitHub

GitHub - ChenhongyiYang/WidthFormer: WidthFormer: Toward Efficient Transformer-based BEV View Transformation (131 stars)