SeMask: Semantically Masked Transformers for Semantic Segmentation (2112.12782v3)

Published 23 Dec 2021 in cs.CV and cs.LG

Abstract: Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. However, such an approach leaves out the semantic context that an image provides during the encoding stage. This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. We provide empirical proof by integrating SeMask into Swin Transformer and Mix Transformer backbones as our encoder paired with different decoders. Our framework achieves a new state-of-the-art of 58.25% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are publicly available at https://github.com/Picsart-AI-Research/SeMask-Segmentation .

Authors (7)

Jitesh Jain (11 papers)
Anukriti Singh (7 papers)
Nikita Orlov (10 papers)
Zilong Huang (42 papers)
Jiachen Li (144 papers)
Steven Walton (16 papers)
Humphrey Shi (97 papers)

Citations (86)

View on Semantic Scholar

Summary

Analysis of "SeMask: Semantically Masked Transformers for Semantic Segmentation"

This paper introduces the SeMask framework, which augments transformer-based architectures for semantic segmentation by integrating semantic context during the encoding phase. Traditional methodologies primarily focus on finetuning pre-trained backbones tailored for classification tasks such as ImageNet, which can result in a sub-optimal utility for tasks requiring dense predictions like semantic segmentation. SeMask seeks to address this gap by proposing a framework that enhances hierarchical transformer encoders with semantic awareness, thereby improving segmentation performance.

The proposed SeMask framework modifies the standard transformer architecture by incorporating a Semantic Layer after each Transformer Layer. This addition consists of several SeMask Blocks, which apply semantic attention to refine features captured from images, thus facilitating richer and contextually aware feature representations. In this enhanced architecture, the semantic attention operation decouples semantic information from the feature map and uses it to guide the refinement of feature maps, thereby creating more precise semantic masks that are critical to segmentation.

The empirical results presented in the paper validate the efficacy of SeMask. Integrating it into Swin Transformer and Mix Transformer, the authors demonstrate considerable performance improvement over baseline models across multiple datasets, such as ADE20K and Cityscapes. Notably, with the Swin Transformer served as the backbone, SeMask achieves state-of-the-art performance, boasting an mIoU of 58.25% on the ADE20K dataset and improvements exceeding 3% on the Cityscapes dataset. These results are compelling given the modest increase in computational costs (FLOPs), showcasing the trade-off benefits regarding accuracy versus efficiency.

Several experiments indicate the transcending benefits of SeMask across different model sizes (Tiny to Large in the Swin Transformer variants) and various pre-training configurations. The augmentation with semantic context noticeably boosts segmentation, even when evaluated under different conditions, such as pretraining datasets and resolution setups.

The paper does not overlook the implementation details critical for replication and future extension. For instance, the use of Semantic-FPN as the decoder illustrates a straightforward integration with existing architectures while emphasizing that the SeMask Blocks are adaptable, thus potentially applicable to other hierarchical vision transformers.

In terms of theoretical implications, SeMask challenges the prevailing notion by highlighting that pre-trained networks—often revered for their robust feature extraction in classification tasks—might not fully suffice for pixel-level tasks without further refinements for semantic understanding. This proposition opens avenues for researchers to explore similar interventions in other downstream tasks that could benefit from integrating domain-specific priors.

One forward-looking direction resulting from this work could involve exploring the application of semantically enriched models in real-time applications where trade-offs between processing speed and accuracy are vital. Moreover, extending this framework to contrastive or adversarial settings could further prompt advancements in generative models and their applications in areas such as video segmentation.

In conclusion, while the SeMask framework significantly enhances semantic segmentation tasks by embedding semantic priors within Vision Transformer backbones, its broader implications pertain to the adaptability of foundational transformer models to meet domain-specific requirements in computer vision, thus paving the way for future exploration and refinement in AI.

PDF Markdown

Related Papers

GitHub

GitHub - Picsart-AI-Research/SeMask-Segmentation: [NIVT Workshop @ ICCV 2023] SeMask: Semantically Masked Transformers for Semantic Segmentation (250 stars)