SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation (2312.17071v2)
Abstract: Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet
- Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9163–9171.
- AFFormer: Head-Free Lightweight Semantic Segmentation with Linear Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1209–1218.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834–848.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, 801–818.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34: 9355–9366.
- The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Ieee.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Rethinking bisenet for real-time semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9716–9725.
- Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12063–12072.
- Dual attention network for scene segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3146–3154.
- Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems, 35: 1140–1156.
- Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 770–778.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Ccnet: Criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision, 603–612.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456. pmlr.
- Locality guidance for improving vision transformers on tiny datasets. In European Conference on Computer Vision, 110–127. Springer.
- Semantic flow for fast and accurate scene parsing. In European Conference on Computer Vision, 775–793. Springer.
- Sfnet: Faster, accurate, and domain agnostic semantic segmentation via semantic flow. arXiv preprint arXiv:2207.04415.
- Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1925–1934.
- Cross-Architecture Knowledge Distillation. In Asian Conference on Computer Vision, 3396–3411.
- Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10012–10022.
- Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision.
- Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Transactions on Intelligent Transportation Systems, 24(3): 3448–3460.
- Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 234–241. Springer.
- Channel-wise knowledge distillation for dense prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5311–5320.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
- Attention is all you need. Neural Information Processing Systems, 30.
- SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation. arXiv preprint arXiv:2301.13156.
- RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer. In Advances in Neural Information Processing Systems.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 568–578.
- Real-time semantic image segmentation via spatial sparsity. arXiv preprint arXiv:1712.00213.
- SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 12077–12090.
- PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19529–19539.
- Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129: 3051–3068.
- BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, 325–341.
- Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916.
- Segvit: Semantic segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35: 4971–4982.
- TopFormer: Token pyramid transformer for mobile semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12083–12093.
- Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision, 405–420.
- Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890.
- Psanet: Point-wise spatial attention network for scene parsing. In European Conference on Computer Vision, 267–283.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890.
- Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition, 633–641.
- A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation. arXiv preprint arXiv:2307.12574.