- The paper introduces the Center Collapse Regularizer (CeCo) to leverage neural collapse theory for mitigating class imbalance in semantic segmentation.
- It employs a dual-branch framework that aligns within-class feature centers to a simplex ETF structure, enhancing discrimination of underrepresented classes.
- Empirical results on datasets like ScanNet200 and ADE20K show significant performance improvements and compatibility with various segmentation architectures.
This paper, "Understanding Imbalanced Semantic Segmentation Through Neural Collapse" (2301.01100), explores the phenomenon of neural collapse in the context of semantic segmentation, particularly focusing on the challenges posed by imbalanced class distributions inherent in such tasks. Neural collapse, previously observed in image classification, describes the convergence of last-layer features and classifier weights to a highly symmetric structure known as a simplex equiangular tight frame (ETF) during training on balanced datasets. This structure provides maximal angular separation between class representatives, which is beneficial for discrimination.
The authors observe that this elegant neural collapse structure does not fully emerge in semantic segmentation models trained on typical datasets like ScanNet200, ADE20K, and COCO-Stuff164K. They identify two key reasons: the contextual correlation between classes in dense prediction tasks (neighboring pixels/points are often semantically related) and the significant class imbalance where some categories occupy vastly more area or points than others. The failure to achieve the symmetric ETF structure, particularly for feature centers, is shown to negatively impact the performance of minor classes, whose features and classifier vectors may end up too close to major classes.
To address this, the paper proposes a practical method called Center Collapse Regularizer (CeCo). The core idea is to encourage the within-class feature centers to converge towards a simplex ETF structure during training, thereby preserving the desirable equiangular and maximally separated properties that benefit discriminative learning, especially for underrepresented classes.
Implementation Details and Architecture:
CeCo is designed as an auxiliary training mechanism that can be easily integrated into existing semantic segmentation architectures. The overall framework consists of two branches:
- Point/Pixel Recognition Branch: This is the standard segmentation model pipeline (e.g., a FCN, UperNet, DeepLabV3+, or MinkowskiNet) which takes the input image or point cloud and outputs per-pixel/point features. This branch is trained using a standard semantic segmentation loss, typically Cross-Entropy (LPR).
- Center Regularization Branch: This new branch operates on the output features from the main backbone.
- For each training sample, the features zi for all pixels/points are extracted.
- Based on the ground truth labels yi, the within-class mean feature vectors (feature centers) zˉk for each class k are computed by averaging the features zi belonging to that class.
- These computed feature centers zˉk are then fed into a classifier layer in the Center Regularization Branch. Crucially, this classifier is fixed to have a simplex ETF structure, initialized based on the class dimension K. This fixed structure acts as a target geometry for the feature centers.
- A separate Cross-Entropy loss (LCR) is computed between the predictions of this fixed ETF classifier on the feature centers zˉk and their corresponding class labels k.
The total training loss is a weighted sum of the two branch losses: Ltotal=LPR+λLCR, where λ is a hyperparameter balancing the two terms.
During inference, the entire Center Regularization Branch is discarded. Only the standard Point/Pixel Recognition Branch with its learned classifier is used for prediction. This ensures that CeCo adds no computational overhead during deployment.
Practical Benefits and Empirical Evidence:
The paper provides empirical evidence for the effectiveness and mechanics of CeCo:
- Reduced Imbalance for Centers: By focusing on class centers rather than individual pixels/points, the effective imbalance factor is significantly reduced (e.g., from 37256 to 597 on ScanNet200), making the problem more tractable for standard optimization.
- Improved Feature Geometry: Experiments show that models trained with CeCo exhibit feature centers that are significantly closer to the desired equiangular and maximally separated ETF structure compared to baseline models. This is visualized by the reduced standard deviation and shifted average of cosines between feature centers.
- Enhanced Performance for Minor Classes: CeCo consistently improves performance, particularly on the "Common" and "Tail" categories in imbalanced datasets like ScanNet200 and ADE20K. This aligns with the theoretical motivation that better-separated feature centers improve discrimination for less frequent classes.
- Orthogonality with Other Losses: CeCo acts as a feature-level regularization and is shown to be orthogonal to and compatible with commonly used segmentation losses like Dice and Lovász losses, leading to further performance gains when combined.
- Flexibility: CeCo is demonstrated to work effectively across various backbone architectures (CNNs like ResNet, HRNet, and Transformers like Swin, BEiT) and segmentation heads (UperNet, OCRNet, DeepLabV3+), on both 2D image and 3D point cloud semantic segmentation tasks.
- State-of-the-Art Results: The method achieves state-of-the-art results on benchmarks like ScanNet200, significantly improving the mean IoU, especially on the tail classes.
Implementation Considerations:
- Training Overhead: The addition of the Center Regularization Branch and the computation of feature centers within each training batch increases the training time. The paper reports an increase of about 10-20% in training time per batch compared to baselines.
- Hyperparameter Tuning: The weight λ for the center collapse loss needs to be tuned, although experiments suggest that performance is consistently improved over a relatively wide range of λ values.
- Applicability: While effective for significantly imbalanced datasets with many classes, the paper notes that the benefits might be less pronounced for datasets with fewer classes or lower imbalance ratios (e.g., Cityscapes, ScanNet v2 20 classes).
The authors have made their code available at https://github.com/dvlab-research/Imbalanced-Learning, allowing practitioners to integrate this method into their own semantic segmentation pipelines. This work translates theoretical insights from neural collapse into a practical, effective regularization technique for the challenging problem of imbalanced semantic segmentation.