Superpixel Transformers for Efficient Semantic Segmentation (2309.16889v2)
Abstract: Semantic segmentation, which aims to classify every pixel in an image, is a key task in machine perception, with many applications across robotics and autonomous driving. Due to the high dimensionality of this task, most existing approaches use local operations, such as convolutions, to generate per-pixel features. However, these methods are typically unable to effectively leverage global context information due to the high computational costs of operating on a dense image. In this work, we propose a solution to this issue by leveraging the idea of superpixels, an over-segmentation of the image, and applying them with a modern transformer framework. In particular, our model learns to decompose the pixel space into a spatially low dimensional superpixel space via a series of local cross-attentions. We then apply multi-head self-attention to the superpixels to enrich the superpixel features with global context and then directly produce a class prediction for each superpixel. Finally, we directly project the superpixel class predictions back into the pixel space using the associations between the superpixels and the image pixel features. Reasoning in the superpixel space allows our method to be substantially more computationally efficient compared to convolution-based decoder methods. Yet, our method achieves state-of-the-art performance in semantic segmentation due to the rich superpixel features generated by the global self-attention mechanism. Our experiments on Cityscapes and ADE20K demonstrate that our method matches the state of the art in terms of accuracy, while outperforming in terms of model parameters and latency.
- Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
- End-to-end object detection with transformers. In ECCV, 2020.
- Transunet: Transformers make strong encoders for medical image segmentation. arXiv:2102.04306, 2021.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
- Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- Mask2former for video instance segmentation. In CVPR, 2022.
- Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Support-vector networks. Machine learning, 20(3):273–297, 1995.
- Autoaugment: Learning augmentation policies from data. In CVPR, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Class segmentation and object localization with superpixel neighborhoods. In ICCV, 2009.
- Superpixel convolutional networks using bilateral inceptions. In ECCV, 2016.
- Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
- Deep residual learning for image recognition. In CVPR, 2016.
- Multiscale conditional random fields for image labeling. In CVPR, 2004.
- Deep networks with stochastic depth. In ECCV, 2016.
- Superpixel sampling networks. In ECCV, pages 352–368, 2018.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, 2011.
- Associative hierarchical crfs for object class image segmentation. In ICCV, 2009.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- A convnet for the 2020s. In CVPR, 2022.
- Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Decoupled weight decay regularization. In ICLR, 2019.
- Stand-alone self-attention in vision models. In NeurIPS, 2019.
- Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Conditional convolutions for instance segmentation. In ECCV, 2020.
- Attention is all you need. In NeurIPS, volume 30, 2017.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
- Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020.
- Non-local neural networks. In CVPR, 2018.
- Solov2: Dynamic and fast instance segmentation. In NeurIPS, 2020.
- Deeplab2: A tensorflow library for deep labeling. arXiv:2106.09748, 2021.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
- Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022.
- k-means mask transformer. In ECCV, 2022.
- Object-contextual representations for semantic segmentation. In ECCV, 2020.
- K-net: Towards unified image segmentation. In NeurIPS, 2021.
- Semantic segmentation by early region proxy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1258–1268, 2022.
- Pyramid scene parsing network. In CVPR, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Alex Zihao Zhu (13 papers)
- Jieru Mei (26 papers)
- Siyuan Qiao (40 papers)
- Hang Yan (86 papers)
- Yukun Zhu (33 papers)
- Liang-Chieh Chen (66 papers)
- Henrik Kretzschmar (12 papers)