SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers (2105.15203v3)

Published 31 May 2021 in cs.CV and cs.LG

Abstract: We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

Citations (3,981)

View on Semantic Scholar

Summary

The paper introduces SegFormer, a Transformer framework that employs a positional-encoding-free hierarchical encoder for enhanced multi-scale feature extraction.
The paper proposes an all-MLP decoder that efficiently aggregates local and global context while reducing computational complexity.
The paper demonstrates state-of-the-art accuracy and robustness on benchmarks like ADE20K and Cityscapes with fewer parameters.

An Expert Overview of "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers"

The paper "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers" introduces an innovative framework for semantic segmentation that combines Transformers with lightweight multilayer perceptron (MLP) decoders. This framework, named SegFormer, focuses on a simplistic yet powerful methodology to bridge the gap between high accuracy and efficiency in semantic segmentation tasks.

Key Features and Contributions

SegFormer distinguishes itself through two primary features: a novel Transformer encoder architecture and a streamlined MLP decoder. Here are the main contributions:

Hierarchical Transformer Encoder:
- Positional-Encoding-Free: The encoder does not require positional encoding, which is traditionally used to maintain spatial information. The absence of positional encoding allows the SegFormer model to generalize better to varying resolutions in testing, circumventing the performance drop commonly associated with interpolation of positional embeddings.
- Multiscale Feature Extraction: The encoder employs a hierarchical structure to generate features at different scales, accommodating both fine and coarse image details. This multiscale approach addresses the single-scale limitation of prior models such as ViT and SETR.
All-MLP Decoder:
- Lightweight Design: SegFormer's decoder is composed solely of MLP layers. The innovation here lies in leveraging the rich, multiscale contextual features extracted by the hierarchical encoder. This design significantly reduces computational complexity, resulting in a decoder that is both efficient and effective.
- Local and Global Attention Aggregation: The decoder aggregates and combines local and global contextual information from multiple encoder layers, harnessing the attentional diversity inherent to Transformers.

Numerical Results and Performance

The paper showcases SegFormer across several benchmarks, including ADE20K, Cityscapes, and COCO-Stuff. Notably, SegFormer demonstrates robust performance and efficiency improvements over state-of-the-art models.

ADE20K: SegFormer scales up from smaller models (B0) to larger models (B5). SegFormer-B4 achieves 50.3% mIoU with 64M parameters, being 5 times smaller and 2.2% more accurate than the previous best method. SegFormer-B5 sets a new state-of-the-art with 51.8% mIoU.
Cityscapes: The SegFormer-B5 model reaches an impressive accuracy of 84.0% mIoU on the validation set, which is 1.8% better while being computationally far more efficient than methods like SETR.

Implications and Future Directions

Practical Implications: The design of SegFormer points towards significant improvements in real-time semantic segmentation applications, particularly where computational resources are limited. The ability to maintain high performance with reduced model size and computational demand makes it apt for deployment in edge devices and real-time systems such as autonomous driving and augmented reality.

Theoretical Implications: The introduction of positional-encoding-free encoders marks a significant deviation from traditional Transformer architectures. This could spur further research into more adaptable and scalable Transformer designs that maintain robustness across varying input resolutions.

Robustness Analysis

The paper also provides a comprehensive robustness evaluation, particularly highlighting SegFormer’s performance on the Cityscapes-C dataset, which includes corrupted images to test the model's resilience. SegFormer demonstrates exceptional robustness, with significantly higher resilience to various perturbations compared to traditional convolutional-based models, cementing its suitability for safety-critical applications.

Conclusion and Future Work

SegFormer represents a step forward in the development of Transformer-based architectures for dense prediction tasks. By simplifying the decoder design and optimizing the encoder's hierarchical structure, it achieves a fine balance between speed and accuracy, setting new benchmarks in semantic segmentation. Future research could delve into further optimization for edge devices and extending the hierarchical design principles to other computer vision tasks. Additionally, exploring the integration with advanced post-processing techniques or hardware accelerations like TensorRT could push the efficiency boundaries even further.

In summary, SegFormer offers a compelling and efficient approach to semantic segmentation, promising to serve as a solid baseline for future innovations in the field.

PDF Markdown

Related Papers

GitHub

GitHub - NVlabs/SegFormer: Official PyTorch implementation of SegFormer (2,957 stars)

YouTube

Show All Videos