SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation (2312.17071v2)

Published 28 Dec 2023 in cs.CV

Abstract: Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet

References (51)

Citations (24)

View on Semantic Scholar

Summary

The paper demonstrates that SCTNet attains real-time semantic segmentation by integrating CNN efficiency with transformer-based semantic alignment only during training.
It introduces an innovative Conv-Former block that mimics transformer capabilities through efficient convolutional operations and grouped normalization.
Extensive tests on Cityscapes, ADE20K, and COCO-Stuff-10K show SCTNet achieving 80.5% mIoU at 62.8 FPS, outperforming conventional architectures.

SCTNet: A Novel Approach for Real-Time Semantic Segmentation

The paper introduces SCTNet, an innovative single-branch convolutional neural network (CNN) enhanced with transformer-based semantic information, specifically designed for real-time semantic segmentation tasks. Traditional approaches in real-time semantic segmentation often incorporate an additional semantic branch to gather comprehensive long-range context, which can unfortunately increase computational burden and reduce inference speed. SCTNet addresses this challenge by maintaining the architectural simplicity and computational efficiency of a single-branch CNN while incorporating the rich semantic information typically held by transformer models.

Key Contributions and Methodology

The primary contributions of the research are as follows:

Single-Branch Real-Time Performance: SCTNet is presented as a single-branch framework, employing a transformer branch exclusively during training to align semantic understanding. By doing so, it achieves high accuracy without the inference-time overhead typically associated with a second branch. This efficiency is particularly advantageous in applications requiring high-speed processing.
Conv-Former Block (CFBlock): Central to the architecture is the CFBlock, which emulates a transformer block using convolutions. This design retains the semantically rich, long-range contextual abilities of transformers while leveraging the compute-efficiency of convolutions. The CFBlock uses grouped normalization and clever kernel design choices, facilitating an efficient yet powerful feature extraction process.
Semantic Information Alignment Module (SIAM): SCTNet introduces SIAM to bridge the gap between CNN and transformer representations. This module includes Backbone Feature Alignment and Shared Decoder Head Alignment, promoting effective feature consistency and quality semantic capture between the CNN and the training-only transformer branch.
Extensive Evaluation: The paper details rigorous experimentation, demonstrating SCTNet's prowess on multiple challenging datasets—Cityscapes, ADE20K, and COCO-Stuff-10K. The results consistently show SCTNet outperforming existing real-time segmentation architectures in terms of the accuracy-speed trade-off.

Numerical Results and Implications

The paper provides compelling results across different benchmarks. For example, SCTNet-B-Seg100 achieves a performance of 80.5% mIoU with a significant processing throughput of 62.8 FPS on the Cityscapes dataset. This establishes SCTNet as a state-of-the-art solution, balancing both accuracy and efficiency suited for real-time processing requirements.

Impact and Future Prospects

The implications of SCTNet are profound, particularly for fields requiring low-latency, high-accuracy segmentation like autonomous driving and real-time scene interpretation. By integrating transformer-like capabilities into CNN frameworks, SCTNet sets a precedent for future models attempting to balance computational load with semantic richness.

In terms of theoretical significance, this work exemplifies the intersection of CNN and transformer models, exploring new ways of distilling knowledge across differing architectures. It challenges the traditional boundaries of model design by advocating for a training-only transformer branch, thus redefining how neural architectures may be conceived in the context of practical constraints.

In the future, research could further explore variant implementations of the SCTNet architecture, potentially scaling the model or specializing its components for domain-specific tasks. Additionally, applying SCTNet's principles to other computer vision problems could unveil further performance enhancements.

Overall, SCTNet significantly advances the state-of-the-art in real-time semantic segmentation by innovatively pairing the efficiencies of CNN with the contextual depth of transformers, providing a robust framework that meets the demanding needs of real-time applications.

PDF Markdown

GitHub

GitHub - xzz777/SCTNet: Official implementation of SCTNet (AAAI2024) (173 stars)