Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation (2312.17071v2)

Published 28 Dec 2023 in cs.CV

Abstract: Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9163–9171.
  2. AFFormer: Head-Free Lightweight Semantic Segmentation with Linear Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence.
  3. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1209–1218.
  4. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062.
  5. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834–848.
  6. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, 801–818.
  7. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34: 9355–9366.
  8. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223.
  9. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Ieee.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  11. Rethinking bisenet for real-time semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9716–9725.
  12. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12063–12072.
  13. Dual attention network for scene segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3146–3154.
  14. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  15. Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems, 35: 1140–1156.
  16. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 770–778.
  17. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  18. Ccnet: Criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision, 603–612.
  19. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456. pmlr.
  20. Locality guidance for improving vision transformers on tiny datasets. In European Conference on Computer Vision, 110–127. Springer.
  21. Semantic flow for fast and accurate scene parsing. In European Conference on Computer Vision, 775–793. Springer.
  22. Sfnet: Faster, accurate, and domain agnostic semantic segmentation via semantic flow. arXiv preprint arXiv:2207.04415.
  23. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1925–1934.
  24. Cross-Architecture Knowledge Distillation. In Asian Conference on Computer Vision, 3396–3411.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10012–10022.
  26. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  28. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision.
  29. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Transactions on Intelligent Transportation Systems, 24(3): 3448–3460.
  30. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.
  31. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 234–241. Springer.
  32. Channel-wise knowledge distillation for dense prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5311–5320.
  33. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
  34. Attention is all you need. Neural Information Processing Systems, 30.
  35. SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation. arXiv preprint arXiv:2301.13156.
  36. RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer. In Advances in Neural Information Processing Systems.
  37. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 568–578.
  38. Real-time semantic image segmentation via spatial sparsity. arXiv preprint arXiv:1712.00213.
  39. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 12077–12090.
  40. PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19529–19539.
  41. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129: 3051–3068.
  42. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, 325–341.
  43. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916.
  44. Segvit: Semantic segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35: 4971–4982.
  45. TopFormer: Token pyramid transformer for mobile semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12083–12093.
  46. Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision, 405–420.
  47. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890.
  48. Psanet: Point-wise spatial attention network for scene parsing. In European Conference on Computer Vision, 267–283.
  49. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890.
  50. Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition, 633–641.
  51. A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation. arXiv preprint arXiv:2307.12574.
Citations (24)

Summary

  • The paper demonstrates that SCTNet attains real-time semantic segmentation by integrating CNN efficiency with transformer-based semantic alignment only during training.
  • It introduces an innovative Conv-Former block that mimics transformer capabilities through efficient convolutional operations and grouped normalization.
  • Extensive tests on Cityscapes, ADE20K, and COCO-Stuff-10K show SCTNet achieving 80.5% mIoU at 62.8 FPS, outperforming conventional architectures.

SCTNet: A Novel Approach for Real-Time Semantic Segmentation

The paper introduces SCTNet, an innovative single-branch convolutional neural network (CNN) enhanced with transformer-based semantic information, specifically designed for real-time semantic segmentation tasks. Traditional approaches in real-time semantic segmentation often incorporate an additional semantic branch to gather comprehensive long-range context, which can unfortunately increase computational burden and reduce inference speed. SCTNet addresses this challenge by maintaining the architectural simplicity and computational efficiency of a single-branch CNN while incorporating the rich semantic information typically held by transformer models.

Key Contributions and Methodology

The primary contributions of the research are as follows:

  1. Single-Branch Real-Time Performance: SCTNet is presented as a single-branch framework, employing a transformer branch exclusively during training to align semantic understanding. By doing so, it achieves high accuracy without the inference-time overhead typically associated with a second branch. This efficiency is particularly advantageous in applications requiring high-speed processing.
  2. Conv-Former Block (CFBlock): Central to the architecture is the CFBlock, which emulates a transformer block using convolutions. This design retains the semantically rich, long-range contextual abilities of transformers while leveraging the compute-efficiency of convolutions. The CFBlock uses grouped normalization and clever kernel design choices, facilitating an efficient yet powerful feature extraction process.
  3. Semantic Information Alignment Module (SIAM): SCTNet introduces SIAM to bridge the gap between CNN and transformer representations. This module includes Backbone Feature Alignment and Shared Decoder Head Alignment, promoting effective feature consistency and quality semantic capture between the CNN and the training-only transformer branch.
  4. Extensive Evaluation: The paper details rigorous experimentation, demonstrating SCTNet's prowess on multiple challenging datasets—Cityscapes, ADE20K, and COCO-Stuff-10K. The results consistently show SCTNet outperforming existing real-time segmentation architectures in terms of the accuracy-speed trade-off.

Numerical Results and Implications

The paper provides compelling results across different benchmarks. For example, SCTNet-B-Seg100 achieves a performance of 80.5% mIoU with a significant processing throughput of 62.8 FPS on the Cityscapes dataset. This establishes SCTNet as a state-of-the-art solution, balancing both accuracy and efficiency suited for real-time processing requirements.

Impact and Future Prospects

The implications of SCTNet are profound, particularly for fields requiring low-latency, high-accuracy segmentation like autonomous driving and real-time scene interpretation. By integrating transformer-like capabilities into CNN frameworks, SCTNet sets a precedent for future models attempting to balance computational load with semantic richness.

In terms of theoretical significance, this work exemplifies the intersection of CNN and transformer models, exploring new ways of distilling knowledge across differing architectures. It challenges the traditional boundaries of model design by advocating for a training-only transformer branch, thus redefining how neural architectures may be conceived in the context of practical constraints.

In the future, research could further explore variant implementations of the SCTNet architecture, potentially scaling the model or specializing its components for domain-specific tasks. Additionally, applying SCTNet's principles to other computer vision problems could unveil further performance enhancements.

Overall, SCTNet significantly advances the state-of-the-art in real-time semantic segmentation by innovatively pairing the efficiencies of CNN with the contextual depth of transformers, providing a robust framework that meets the demanding needs of real-time applications.

Github Logo Streamline Icon: https://streamlinehq.com