SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation (2209.08575v1)

Published 18 Sep 2022 in cs.CV

Abstract: We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).

Citations (472)

View on Semantic Scholar

Summary

The paper introduces SegNeXt, a convolutional architecture that uses multi-scale convolutional attention (MSCA) to outperform transformer-based methods.
The model employs an encoder-decoder framework with depth-wise and strip convolutions to effectively aggregate local and global features.
Experiments demonstrate that SegNeXt achieves a mIoU of 90.6% on Pascal VOC and a 2.0% improvement on ADE20K while using only 10% of the parameters compared to competing models.

Overview: SegNeXt - Rethinking Convolutional Attention for Semantic Segmentation

This paper presents SegNeXt, a convolutional architecture developed for semantic segmentation. The researchers argue that convolutional attention can outperform the self-attention mechanisms typical in transformer-based models. This assertion is substantiated by impressive empirical performance across multiple benchmarks, suggesting significant improvements in efficiency and effectiveness for semantic segmentation tasks.

Architecture and Design

The SegNeXt architecture revolves around the Multi-Scale Convolutional Attention (MSCA) module. Unlike traditional applications leveraging convolutions in decoding tasks, SegNeXt employs an encoder-decoder approach with an emphasis on multi-scale convolutional features. The encoder is constructed around depth-wise convolutions supplemented by strip convolutions, enabling local and strip-like feature aggregation. The MSCA module introduces attention via an element-wise multiplication, extending the spatial attention capabilities while maintaining computational efficiency.

The network functions predominantly through convolutional operations, incorporating a Hambuger module in the decoder for global context extraction. This design choice leads to reduced computational complexity, particularly advantageous for high-resolution imagery in remote sensing or urban environments.

Performance and Evaluation

SegNeXt shows remarkable results across several standard datasets, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. The architecture achieves a mIoU of 90.6% on Pascal VOC 2012, utilizing merely 10% of the parameters compared to EfficientNet-L2 w/ NAS-FPN. Furthermore, SegNeXt consistently outperforms state-of-the-art methods with a 2.0% mIoU improvement on the ADE20K dataset, using equal or fewer computational resources.

Implications and Future Directions

The research outlines a clear case for the potential superiority of convolutional attention in semantic segmentation. This work challenges the dominance of transformer-based segmentation models by demonstrating that convolutionally grounded designs can achieve competitive, if not superior, results through efficient and effective context encoding.

In the broader context of AI developments, SegNeXt suggests a promising shift back towards convolutions with innovations such as MSCA. The paper also opens pathways to explore mixed architectures where convolutional strengths are utilized alongside emerging attention mechanisms.

Future research might explore scaling SegNeXt into larger model frameworks or extending conjugate methodologies into diverse vision and language processing applications. The findings here could rejuvenate interest in convolutional networks, interrupting the ascendant curve of transformer models across other computer vision challenges.

Overall, SegNeXt represents a significant advancement for those seeking efficient, scalable solutions within semantic segmentation, suggesting continued exploration and optimization of convolutional methodologies.

PDF Markdown