Class Center Block (CCB) in Semantic Segmentation
- Class Center Block (CCB) is a module that extracts categorical context by aggregating feature activations weighted by class probability distributions.
- It integrates into ACFNet by producing per-class feature aggregates and redistributing them with a class attention mechanism to refine pixelwise predictions.
- Empirical results on datasets like Cityscapes demonstrate that incorporating the CCB significantly improves mIoU, yielding state-of-the-art segmentation performance.
The Class Center Block (CCB) is a module designed to extract categorical context for semantic segmentation by representing image-wide class-level features using a weighted average of feature activations. Proposed as a core component of the Attentional Class Feature (ACF) module in ACFNet, CCB enables the introduction of global class-driven contextual information into pixelwise prediction pipelines. This approach contrasts with prior context modules primarily oriented toward spatial relations, utilizing the probability distributions over class assignments (produced by a coarse prediction head) to compute per-class feature aggregates, which are then used to enhance semantic discrimination in subsequent network stages (Zhang et al., 2019).
1. Mathematical Definition of the Class Center Block
Given a high-level feature map and a coarse segmentation probability map for semantic classes, the CCB proceeds as follows:
- Reduce channel dimensionality: , with .
- Flatten spatial dimensions: reshape to and to .
- Compute per-class feature sums: , 0.
- Compute normalization factors: 1, 2.
- Obtain class centers by normalization:
3
where each row 4 is the mean feature vector for class 5, weighted by the likelihood assigned by 6.
This design enables efficient computation via tensor reshaping, two batched matrix multiplications, and per-class normalization (Zhang et al., 2019).
2. Architectural Role within the Attentional Class Feature Module
The CCB is one of two principal subcomponents in the Attentional Class Feature (ACF) module, the other being the Class Attention Block (CAB):
- Class Center Block (CCB): Accepts 7 and 8, outputs 9 following the previously defined procedure.
- Class Attention Block (CAB): Redistributes class center features at the pixel level using 0 as attention weights. For each pixel 1, the attended feature vector is computed as
2
- Fusion and Output: The attended feature map 3 is concatenated with the original feature map 4 (or a projection thereof) along the channel axis, forming 5, which is processed by a final 6 convolution and fed to a fine segmentation classifier (Zhang et al., 2019).
3. Integration within Coarse-to-Fine Semantic Segmentation Networks
CCB is architecturally situated in a coarse-to-fine segmentation pipeline, exemplified by ACFNet:
- Extract feature map 7 using any backbone (e.g., ResNet-101 with dilated convolutions).
- Generate coarse segmentation logits via 8 convolution and produce 9 using softmax.
- CCB and CAB compute class centers and combine them with per-pixel attention to generate refined features.
- Enriched features are concatenated and transformed, producing refined logits for final segmentation.
A representative forward pass is summarized as:
- 0
- 1
- 2
- 3
- 4
- 5 (Zhang et al., 2019).
4. Training Objectives and Optimization Strategy
Training employs up to three cross-entropy losses:
- 6: Auxiliary loss on an intermediate feature, following the PSPNet convention.
- 7: Cross-entropy on coarse segmentation logits.
- 8: Cross-entropy on fine segmentation logits.
The total loss is 9, with coefficient values 0. Training uses SGD (base lr 0.01, momentum 0.9, weight decay 1), "poly" learning rate decay, batch size 8 over 4 GPUs, 40k iterations, augmentation including random flip, scaling, and cropping, as well as InPlaceABN-Sync for batch normalization (Zhang et al., 2019).
5. Empirical Effect and Benchmark Results of CCB
Ablation studies on Cityscapes with a ResNet-101 dilated baseline (75.85% mIoU) demonstrate that:
- Incorporating CCB ("class center" only, no CAB): 77.94% mIoU (+2.09)
- Full ACF (concatenation): 79.17% mIoU (+3.32)
- Full ACF (summation, default): 79.32% mIoU (+3.47)
On a ResNet-101 + ASPP baseline (78.42% mIoU):
- ACF module: 80.08% (+1.66)
- ACF + online bootstrapping: 80.99%
- ACF + multi-scale/flip test: 81.46%
ACFNet achieves 81.85% mIoU on the Cityscapes test set (train+val only, fine annotations), setting a new state of the art at publication (Zhang et al., 2019). This suggests the incorporation of categorical/global class context via CCB delivers consistent improvement over purely spatial-context approaches in high-resolution urban scene parsing.
6. Significance, Implementation, and Extension
The CCB provides an efficient, plug-and-play mechanism for injecting non-spatial, class-wise contextual information into semantic segmentation frameworks. Its design supports straightforward batched implementation and is readily adaptable to any off-the-shelf backbone, decoupled from the spatial structure of context modules in prior works. The splitting of the context estimation into a categorical block (CCB) and an attention block (CAB) enables principled coarse-to-fine refinement. A plausible implication is the potential applicability of CCB outside urban scene parsing, wherever class-discriminative global context can be leveraged to complement local feature learning (Zhang et al., 2019).