Per-Pixel Classification is Not All You Need for Semantic Segmentation (2107.06278v2)

Published 13 Jul 2021 in cs.CV

Abstract: Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Citations (1,315)

View on Semantic Scholar

Summary

The paper presents MaskFormer, a unified framework that leverages mask classification to seamlessly integrate semantic and instance segmentation.
It employs a lightweight FPN-based pixel decoder and multi-layer Transformer to generate per-segment embeddings, achieving state-of-the-art mIoU on benchmarks like ADE20K and COCO panoptic.
Its approach reduces computational costs while scaling efficiently to large category vocabularies, simplifying both training and inference in complex real-world applications.

MaskFormer: Unifying Semantic and Instance Segmentation with Mask Classification

Semantic segmentation has traditionally been approached as a per-pixel classification problem, necessitating a static number of predicted regions per image. While effective, this paradigm faces limitations, particularly in dealing with a large number of categories and integrating instance-level segmentation. In their work, Cheng et al. propose MaskFormer, which employs mask classification to streamline and unify both semantic and instance-level segmentation tasks within a single framework. This innovative approach eliminates the discrepancies between the two tasks, improving efficiency and performance, especially for complex category-rich datasets.

MaskFormer Model

MaskFormer utilizes the mask classification paradigm, predicting a set of binary masks accompanied by single class labels. This enables a unified treatment for semantic and instance segmentation, simplifying the model design and training processes. The model consists of:

Pixel-Level Module: Extracts per-pixel embeddings with a light-weight FPN-based pixel decoder.
Transformer Module: Utilizes six Transformer decoder layers to produce per-segment embeddings from image features and learnable query embeddings.
Segmentation Module: Generates classification scores and mask embeddings from per-segment embeddings. The mask embeddings are then used to create binary masks via a dot product with per-pixel embeddings.

This architecture underscores the flexibility of mask classification, allowing a dynamic number of predictions and surpassing per-pixel classification models in terms of efficiency and performance.

Numerical Performance

MaskFormer's evaluation across multiple datasets reveals its robustness. On ADE20K, it achieves state-of-the-art results with 55.6 mIoU, outperforming the best per-pixel classification approaches like Swin-UperNet. The model's efficiency is highlighted by its reduced parameters and computational cost compared to per-pixel models, coupled with a simplified inference strategy.

Semantic Segmentation:

On ADE20K, MaskFormer achieves 46.7 mIoU with ResNet-50 and 55.6 mIoU with Swin-L, marking a notable advancement over previous state-of-the-art methods.
On ADE20K-Full, which includes 847 classes, MaskFormer outperforms its per-pixel counterparts by 3.5 mIoU, demonstrating superior handling of large vocabularies.
Performance on Cityscapes (with fewer categories) is on par with leading per-pixel models but shows increased Recognition Quality (RQ) due to improved segmentation at the region level.

Instance Segmentation:

On COCO panoptic, MaskFormer achieves 52.7 PQ with Swin-L, showcasing its applicability to panoptic segmentation with superior results over DETR and Max-DeepLab.

Implications of the Research

MaskFormer's mask classification method presents several significant implications:

Unified Framework: It simplifies the model design by unifying semantic and instance-level segmentation, reducing the complexity of maintaining separate models.
Scalability: The model handles datasets with a large number of categories more efficiently, making it suitable for real-world applications where class diversity is substantial.
Efficient Training and Inference: MaskFormer reduces the number of parameters and FLOPs, leading to faster training and inference cycles which are crucial in production environments.

Future Developments in AI

MaskFormer's success opens several avenues for future research and application in AI:

Extended Applications: Beyond traditional image segmentation tasks, MaskFormer's principles can be extended to medical imaging, autonomous driving, and other domains requiring fine-grained image analysis.
Improved Architectures: Continued refinement of Transformer-based architectures and mask embedding modules could further enhance performance and efficiency.
Hybrid Models: Combining mask classification with other techniques, such as graph-based segmentation or multi-scale feature aggregation, may provide additional performance gains.
Robustness and Adaptability: Future models can focus on enhancing robustness to various image conditions and adaptability to different segmentation tasks without retraining.

Conclusion

MaskFormer by Cheng et al. represents a significant shift in image segmentation paradigms, offering a unified and efficient approach to both semantic and instance-level segmentation. Its robust performance across diverse datasets, coupled with efficiency in training and inference, illustrates the potential of mask classification to replace per-pixel classification in various real-world applications. As AI continues to evolve, MaskFormer sets a new benchmark for segmentation tasks, encouraging further research and development in this promising direction.

PDF Markdown

Related Papers

YouTube

Show All Videos