Rethinking BiSeNet For Real-time Semantic Segmentation (2104.13188v1)

Published 27 Apr 2021 in cs.CV

Abstract: BiSeNet has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images.

Citations (425)

View on Semantic Scholar

Summary

The paper presents the STDC network, which reduces redundancy by integrating feature aggregation within a single-stream architecture.
It achieves up to 71.9% mIoU at 250.4 FPS on Cityscapes, demonstrating a balance of high accuracy and real-time speed.
The design paves the way for efficient segmentation in real-time applications such as autonomous driving and embedded systems.

An Overview of "Rethinking BiSeNet For Real-time Semantic Segmentation"

The paper "Rethinking BiSeNet For Real-time Semantic Segmentation" proposes a novel architecture for enhancing the efficacy of semantic segmentation tasks. The original BiSeNet, well-regarded for its dual-path strategy, while effective, presents inefficiencies due to its reliance on additional pathways for spatial information processing. The contributions of the paper lie in the design of a new network architecture, termed Short-Term Dense Concatenate network (STDC network), which aims to alleviate these inefficiencies.

Methodological Contributions

The principal innovation introduced in the paper is the STDC network, characterized by a reduction in structural redundancy. Key to this is the Short-Term Dense Concatenate module (STDC module), which aggregates feature maps across varied scales to form a rich representation of image data. By decreasing feature map dimensions gradually, the STDC module strategically harmonizes between maintaining spatial detail and minimizing computational load.

In contrast to the bilateral structure of the original BiSeNet, the proposed architecture integrates the detailing process directly within the main network stream. The Detail Aggregation module further enhances this process by embedding spatial information within low-level layers. This design eschews the necessity for auxiliary paths, improving overall computational efficiency.

Experimental Validation

Extensive experiments are conducted on two prevalent datasets: Cityscapes and CamVid, which are benchmarks for urban scene segmentation and road scene segmentation, respectively. Crucially, the proposed architecture demonstrates a significant improvement in balancing segmentation accuracy and inference speed. On the Cityscapes dataset, the STDC network achieves a mean Intersection over Union (mIoU) of 71.9% on the test set, with an impressive speed of 250.4 frames per second (FPS) on a GTX 1080Ti. This is a notable enhancement over prior models, offering a 45.2% increase in speed.

Additionally, the network shows adaptability in handling higher resolutions, achieving 76.8% mIoU with 97.0 FPS inference, underscoring its practical utility across varied application scales.

Theoretical and Practical Implications

The rethinking of network architecture for semantic segmentation suggests a paradigm shift towards more integrated and efficient feature processing. The fusion of low-level and high-level features in a single-stream, detailed-augmented network shifts the emphasis from external paths to internal streamlining. This could influence future segmentation models to consider more intrinsic, detail-sensitive designs that minimize redundant pathways.

Moreover, the performance gains exhibited by the STDC networks highlight the potential for real-time applications in fields such as autonomous driving and video surveillance, where processing speed and accuracy are critical.

Future Prospects

Looking ahead, the design of the STDC network opens avenues for further exploration within AI. Extending this approach to other tasks like object detection could validate the universality of the proposed structural changes. Additionally, the implications for lightweight model design, potentially enhancing mobile and embedded system applications, are noteworthy.

In summary, the paper presents a comprehensive re-evaluation of BiSeNet, leading to a more efficient semantic segmentation framework. The STDC network not only signifies improvements in execution speed and accuracy but also sets the stage for future advancements in computation-efficient neural architectures.

PDF Markdown