Papers
Topics
Authors
Recent
Search
2000 character limit reached

HFT: Lifting Perspective Representations via Hybrid Feature Transformation

Published 11 Apr 2022 in cs.CV | (2204.05068v1)

Abstract: Autonomous driving requires accurate and detailed Bird's Eye View (BEV) semantic segmentation for decision making, which is one of the most challenging tasks for high-level scene perception. Feature transformation from frontal view to BEV is the pivotal technology for BEV semantic segmentation. Existing works can be roughly classified into two categories, i.e., Camera model-Based Feature Transformation (CBFT) and Camera model-Free Feature Transformation (CFFT). In this paper, we empirically analyze the vital differences between CBFT and CFFT. The former transforms features based on the flat-world assumption, which may cause distortion of regions lying above the ground plane. The latter is limited in the segmentation performance due to the absence of geometric priors and time-consuming computation. In order to reap the benefits and avoid the drawbacks of CBFT and CFFT, we propose a novel framework with a Hybrid Feature Transformation module (HFT). Specifically, we decouple the feature maps produced by HFT for estimating the layout of outdoor scenes in BEV. Furthermore, we design a mutual learning scheme to augment hybrid transformation by applying feature mimicking. Notably, extensive experiments demonstrate that with negligible extra overhead, HFT achieves a relative improvement of 13.3% on the Argoverse dataset and 16.8% on the KITTI 3D Object datasets compared to the best-performing existing method. The codes are available at https://github.com/JiayuZou2020/HFT.

Citations (18)

Summary

  • The paper introduces HFT, a dual-branch framework that fuses geometric and global transformers to enhance BEV semantic segmentation.
  • It employs mutual learning between a geometric transformer and a global transformer to address the limitations of CBFT and CFFT methods.
  • Evaluations on Argoverse, KITTI, and nuScenes demonstrate up to a 16.8% mIoU improvement while reducing model parameters.

Lifting Perspective Representations via Hybrid Feature Transformation

This paper introduces a novel framework called Hybrid Feature Transformation (HFT) designed to enhance Bird's Eye View (BEV) semantic segmentation in autonomous driving applications. BEV semantic segmentation is crucial for the development of autonomous vehicles as it facilitates high-level scene perception necessary for decision making and navigation tasks such as motion prediction and obstacle avoidance.

The primary challenge addressed by the paper concerns the transformation of frontal view images to BEV representations. Previous approaches to this problem can be broadly categorized into two types: Camera model-Based Feature Transformation (CBFT) and Camera model-Free Feature Transformation (CFFT). CBFT methods leverage geometric priors, typically using Inverse Perspective Mapping (IPM), to project features from the frontal view to the BEV. However, these methods rely on the flat-world assumption, limiting their efficacy for objects lying above the ground plane. On the other hand, CFFT methods adopt neural architectures to learn a projection from frontal to BEV views without relying on geometric assumptions, which can slow convergence and potentially degrade performance in some scenarios.

The HFT framework seeks to combine the strengths of both CBFT and CFFT while mitigating their respective weaknesses. It incorporates a dual-branch architecture consisting of a Geometric Transformer and a Global Transformer. The Geometric Transformer processes features using geometric priors to achieve a rough BEV transformation. Meanwhile, the Global Transformer employs attention mechanisms to capture global spatial correlation without relying on the camera's geometrical constraints.

In a mutual learning scheme, both branches learn from each other through feature mimicking, aiding in forming a more robust representation. The mutual learning aspect is a key innovation, designed to blend geometric inference with deep learning's representational power, which represents a marked advantage over utilizing either CBFT or CFFT in isolation.

The effectiveness of HFT is demonstrated through extensive experiments conducted on several datasets, including Argoverse, KITTI 3D Object, and nuScenes. The results indicate that HFT achieves substantial improvements in mIoU scores: a relative improvement of 13.3% on the Argoverse dataset and 16.8% on the KITTI 3D Object datasets over current state-of-the-art methods. For example, HFT achieves a noteworthy reduction in model parameters by 31.2% compared to View Parsing Network (VPN), yet still delivers a 21.1% increase in performance.

This paper also contributes to the ongoing discourse on the trade-offs between leveraging camera geometric priors and learning flexible feature transformations. By proposing a framework that integrates these disparate techniques, the paper outlines a path forward for more efficacious applications in BEV semantic segmentation, enhancing the real-time capabilities crucial for autonomous driving.

Overall, the implications of this research are significant for both theory and practice in the autonomy domain. The hybrid approach offers promising directions for future developments in AI, particularly in refining the perceptual faculties of robots and autonomous systems in complex environments. Further research could explore the extension of HFT to incorporate temporal dependencies or collaborative mapping across multiple vehicles, potentially leading to more cohesive and reliable autonomous systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.