BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation (2004.02147v1)

Published 5 Apr 2020 in cs.CV

Abstract: The low-level details and high-level semantics are both essential to the semantic segmentation task. However, to speed up the model inference, current approaches almost always sacrifice the low-level details, which leads to a considerable accuracy decrease. We propose to treat these spatial details and categorical semantics separately to achieve high accuracy and high efficiency for realtime semantic segmentation. To this end, we propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral Segmentation Network (BiSeNet V2). This architecture involves: (i) a Detail Branch, with wide channels and shallow layers to capture low-level details and generate high-resolution feature representation; (ii) a Semantic Branch, with narrow channels and deep layers to obtain high-level semantic context. The Semantic Branch is lightweight due to reducing the channel capacity and a fast-downsampling strategy. Furthermore, we design a Guided Aggregation Layer to enhance mutual connections and fuse both types of feature representation. Besides, a booster training strategy is designed to improve the segmentation performance without any extra inference cost. Extensive quantitative and qualitative evaluations demonstrate that the proposed architecture performs favourably against a few state-of-the-art real-time semantic segmentation approaches. Specifically, for a 2,048x1,024 input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy.

Authors (6)

Changqian Yu (28 papers)
Changxin Gao (76 papers)
Jingbo Wang (138 papers)
Gang Yu (114 papers)
Chunhua Shen (404 papers)
Nong Sang (86 papers)

Citations (1,053)

View on Semantic Scholar

Summary

Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation

In the field of semantic segmentation, achieving a balance between high segmentation accuracy and efficient real-time processing remains a core challenge. The paper "BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation" introduces an innovative approach aimed at addressing this problem by leveraging a dual-pathway architecture that strategically separates low-level spatial details from high-level semantic information.

Architectural Design

The Bilateral Segmentation Network (BiSeNet V2) consists of two primary branches: the Detail Branch and the Semantic Branch. The Detail Branch is designed with wide channels and shallow layers to capture intricate spatial details, maintaining high-resolution features. In contrast, the Semantic Branch employs narrow channels and deep layers to obtain comprehensive high-level semantic context. This branch is further optimized with a fast-downsampling strategy to boost efficiency.

To effectively merge the outputs of these branches, the authors introduce a Guided Aggregation Layer. This layer enhances mutual connections between the branches, facilitating a robust fusion of detailed and semantic features. Additionally, the paper introduces a booster training strategy, incorporating auxiliary prediction heads that aid in refining the segmentation performance during training without adding inference complexity.

Quantitative Performance

The proposed BiSeNet V2 architecture underwent extensive evaluation on several benchmarks, including Cityscapes, CamVid, and COCO-Stuff. The results are notable:

On the Cityscapes test set, with a $2048\times1024$ input, BiSeNet V2 achieves 72.6% Mean Intersection over Union (Mean IoU) at 156 frames per second (FPS) on an NVIDIA GeForce GTX 1080 Ti.
The larger variant, BiSeNetV2-Large, achieves 75.3% Mean IoU with a speed of 47.3 FPS.
On CamVid, the model achieves an impressive 72.4% Mean IoU at 124.5 FPS, with the larger variant achieving 73.2% Mean IoU at 32.7 FPS.
On the COCO-Stuff dataset, BiSeNet V2 achieves 25.2% Mean IoU and 60.5% pixel accuracy at 87.9 FPS.

These results underscore the effectiveness of BiSeNet V2 in delivering high segmentation accuracy while maintaining real-time processing capability, outperforming several state-of-the-art methods in both aspects.

Theoretical and Practical Implications

The bifurcated approach of BiSeNet V2 holds significant theoretical and practical implications:

Theoretical Implications: From a theoretical perspective, the separation of detail and semantic processing pathways provides a compelling demonstration of how specialized branches can synergistically improve both processing efficiency and representation richness. This modular approach also offers flexibility in scaling the network's capacity to accommodate various application-specific requirements without fundamentally altering the architecture.
Practical Implications: Practically, the high frame rate achieved by BiSeNet V2 paves the way for its deployment in latency-sensitive applications such as autonomous driving, real-time video surveillance, and human-machine interaction systems. The guided aggregation and booster training strategies ensure that the added performance benefits are attained without imposing additional computational loads during inference, which is critical for deployment on resource-constrained devices.

Future Directions

The BiSeNet V2 framework opens several avenues for future research:

Transfer Learning and Pretraining: Investigating transfer learning techniques and pretraining strategies on larger, more diverse datasets could further enhance the model's robustness and generalization capabilities across different scenes.
Optimization for Edge Devices: Adapting and optimizing BiSeNet V2 for edge computing environments can target applications needing on-device processing, ensuring real-time performance under constrained hardware conditions.
Extended Applications: Exploring the application of this architecture in 3D semantic segmentation tasks or integrating it with multi-modality inputs (e.g., LiDAR, radar) could provide comprehensive environment perception in automotive and robotics applications.

Conclusion

The BiSeNet V2 paper presents a well-founded and rigorously evaluated approach to real-time semantic segmentation, achieving an outstanding trade-off between segmentation accuracy and inference speed. By separating and then effectively combining the low-level and high-level information processing through a bilateral network, this work stands as a substantive contribution to both the theoretical advancement and practical application of semantic segmentation methods in real-time scenarios.

PDF Markdown