Accurate classification of generated masks under ambiguous boundaries and class imbalance

Determine how to achieve accurate classification of generated masks produced by mask-based image segmentation models when object boundaries are ambiguous and the class distribution is imbalanced.

Background

Mask-based segmentation methods (e.g., MaskFormer, Mask2Former, InternImage, OneFormer) generate high-quality masks by leveraging global context, but their classification heads often underperform, especially when objects have fuzzy or overlapping boundaries and when datasets exhibit severe class imbalance. This limits overall segmentation quality across semantic, instance, and panoptic tasks.

The paper proposes ViT-P, a decoupled, point-based classification module built on Vision Transformers, to classify masks via their central points and shows improvements across ADE20K, Cityscapes, and COCO. Despite these gains, the authors explicitly identify robust mask classification under ambiguity and imbalance as an open challenge motivating continued research.

References

However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge.

The Missing Point in Vision Transformers for Universal Image Segmentation  (2505.19795 - Shahabodini et al., 26 May 2025) in Abstract (first paragraph)