Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
The paper "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers" presents a novel approach to semantic segmentation by utilizing transformers in place of traditional fully convolutional networks (FCNs). The authors assert that the reliance on convolutional encoders inherently limits the receptive field, which is problematic for effective contextual modeling. Instead, they propose treating semantic segmentation as a sequence-to-sequence prediction task, hypothesizing that transformers can address receptive field limitations more effectively.
Model Architecture
In traditional FCN-based approaches to semantic segmentation, the encoder progressively reduces spatial resolution through convolutional layers, which can hinder long-range dependency learning. The paper challenges this architecture by introducing the SEgmentation TRansformer (SETR). The proposed SETR model reshapes the problem by treating the input image as a sequence of patches, which are processed by a transformer encoder to capture global context at each layer—thereby eliminating the need for progressive spatial resolution reduction.
Image Sequentialization
The initial step involves decomposing an image into a grid of fixed-size patches and flattening each patch into vectors, which are then embedded into a linear space. Position embeddings specific to each patch maintain spatial information during sequence transformation. This sequence of vectors forms the input to the transformer encoder, which models global dependencies through self-attention mechanisms.
Transformer Encoder
The core of SETR is a pure transformer encoder that operates on the sequence of patch embeddings. Each layer of the transformer consists of multi-head self-attention (MSA) and multilayer perceptron (MLP) blocks. The use of MSA allows the model to attend to different parts of the sequence with different heads, thereby capturing diverse contextual interactions. The output features from the transformer encoder can then be reshaped back into a spatial feature map suitable for segmentation.
Decoder Designs
The authors introduce three decoder architectures to evaluate the performance of their proposed transformer-based encoder:
- Naive Upsampling (Naive): This approach directly projects the transformer's output to the number of target classes and performs bilinear upsampling to generate the final segmentation map.
- Progressive Upsampling (PUP): This method gradually upscales the feature maps by alternating between convolutional layers and bilinear upsampling, thus preserving feature integrity through incremental resolution enhancements.
- Multi-Level Feature Aggregation (MLA): This decoder aggregates features from multiple layers of the transformer encoder. Unlike traditional feature pyramid networks, the features aggregated are from layers of the same resolution, enriched by attention mechanisms at different stages.
Experimental Results
The authors conduct extensive experiments on several benchmark datasets including ADE20K, Pascal Context, and Cityscapes. The results demonstrate that SETR establishes new state-of-the-art performance on ADE20K and Pascal Context, and competitive results on Cityscapes. Notably, the SETR model with MLA decoder achieves 50.28% mIoU on ADE20K with multi-scale inference, a substantial improvement over previous state-of-the-art results. These findings suggest that SETR's ability to model global context at each layer provides a significant performance advantage over traditional FCN-based models.
Implications and Future Directions
The theoretical and practical implications of this research are profound. By successfully applying transformers to image-based tasks, this work bridges the gap between NLP and computer vision, suggesting that self-attention can effectively replace convolution in certain contexts. This paradigm shift opens up new avenues for rethinking other vision tasks currently dominated by convolutional architectures.
Furthermore, the ability to model long-range dependencies without progressively reducing spatial resolution can inspire novel architectures beyond segmentation tasks. Future research could delve into optimizing transformer models for better computational efficiency, or explore hybrid models that combine the strengths of convolutional and transformer-based approaches.
Overall, the findings underscore the transformative potential of sequence-to-sequence models like transformers in the field of computer vision, paving the way for continued innovation in learning complex visual representations.