CrispFormer: Lightweight Transformer for WSSS
- The paper introduces a transformer-based framework that integrates boundary supervision, uncertainty-guided refinement, and dynamic multi-scale fusion to boost segmentation mask fidelity.
- It employs a MiT encoder paired with a modified decoder that improves fine structure resolution and small-object recall without significant extra computation.
- Empirical results demonstrate a notable improvement (~5.9 mIoU increase) and enhanced boundary and small-object recall compared to the standard SegFormer baseline.
CrispFormer is a lightweight transformer-based neural network framework for weakly supervised semantic segmentation (WSSS). It is designed to learn accurate, dense segmentation masks from under-specified, noisy cues such as image-level labels or weak pseudo-masks. CrispFormer builds upon the SegFormer architecture by introducing three critical, synergistic modifications to the decoder: a boundary supervision branch, an uncertainty-guided refinement mechanism, and a dynamic multi-scale fusion (DMF) module. These changes yield significant improvements in mask fidelity, notably on fine structures and uncertain regions, while maintaining minimal additional computational cost and no changes to the backbone. CrispFormer integrates seamlessly into standard WSSS pipelines, offering reproducible gains over SegFormer baselines with no requirement for heavy post-processing or CRF-based refinement (Torabi et al., 24 Nov 2025).
1. Architectural Structure
CrispFormer architecture comprises a MiT (Mix Transformer) encoder and an enhanced decoder operating at 1/4 spatial resolution.
- MiT Encoder: The encoder remains unchanged from SegFormer, where input is passed through a pretrained MiT-B5 backbone, producing a pyramid of four feature maps:
- ,
- ,
- ,
- .
- CrispFormer Decoder: The decoder comprises three interconnected branches:
- Dynamic Multi-Scale Fusion (DMF): Projects and upsamples each encoder feature to , then fuses via spatial softmax over scales:
- Uncertainty-Guided Refiner (UGR): Predicts per-pixel aleatoric variance, aggregates it, and uses these uncertainty maps to gate residual corrections on the class logits.
- Boundary Head: Supervises thin object boundaries by regressing a single-channel boundary map with a lightweight head.
The entire procedure, including segmentation, variance, refinement, and boundary modules, operates in a single forward pass (Torabi et al., 24 Nov 2025).
2. Boundary Supervision Branch
The boundary branch is designed to enhance the CrispFormer’s ability to resolve fine contours and thin structures—commonly eroded by weak supervision or standard decoders.
- Head: The branch taps the fused feature map and applies a convolution followed by a projection to produce , which is upsampled to .
- Supervision: Uses a class-agnostic, 2-pixel-wide boundary pseudo-ground-truth band from the seed mask .
- Loss: Combines binary cross-entropy (BCE) and Dice loss:
Optionally, a signed distance function (SDF) penalty on the segmentation output can be included:
where and indicates the signed distance transform of the pseudo-mask.
This explicit supervision enforces crisp, thin contours and mitigates boundary blurring (Torabi et al., 24 Nov 2025).
3. Uncertainty-Guided Refinement Mechanism
The uncertainty-guided refiner directly models pixel-level uncertainty, allowing the network to adaptively correct low-confidence predictions and reweight loss contributions based on both aleatoric and entropy-derived uncertainty.
- Aleatoric Uncertainty: The variance head predicts per-class log-variance, which is processed as:
Aggregated across classes:
- Residual Correction: Gated residuals are computed using a combination of , the softmax probability , and uncertainty $U_{\ale}$:
- Uncertainty-Weighted Loss: The final segmentation loss is weighted per-pixel by normalized combined uncertainty:
The auxiliary heteroscedastic loss for pre-refine logits is:
Combined as:
This mechanism enables CrispFormer to handle ambiguous and noisy supervision in WSSS more robustly (Torabi et al., 24 Nov 2025).
4. Dynamic Multi-Scale Fusion
The DMF module enables spatially adaptive selection of feature scales, replacing static concatenation in standard decoders. At each spatial location, the module computes a spatial softmax over multi-resolution features, fusing encoder outputs as:
Optional modulation subtracts scaled downsampled uncertainty from :
This adaptive fusion increases responsiveness to small objects by emphasizing finer scales near detailed content, boosting small-object recall (Torabi et al., 24 Nov 2025).
5. Training Protocol and Integration in WSSS Pipelines
CrispFormer is integrated into a standard three-stage WSSS pipeline:
- Stage A—Seed Generation: Pseudo masks are produced via a separate CAM-based model (e.g., MiT-CAM). High-uncertainty pixels are ignored by a mask . Optionally, seeds are post-processed with a one-time CRF or affinity smoothing.
- Stage B—Student Training: Images are randomly scaled (448–768 pixels) and cropped (512512). The combined loss is:
with default hyperparameters: , , , , , . AdamW optimizer with weight decay , initial learning rate , backbone LR at , cosine decay schedule, 1-epoch warm-up, 40–60 epochs.
- Stage C—EMA Teacher and Pseudo-Label Refresh: An Exponential Moving Average teacher is maintained, and pseudo labels are refreshed every epochs by thresholding teacher uncertainty ; pixels with are retained. No test-time CRF or post-processing is applied; inference is a single forward pass (Torabi et al., 24 Nov 2025).
6. Empirical Results
Empirical evaluation on the PASCAL VOC 2012 validation set (single-scale) demonstrates that CrispFormer consistently outperforms the SegFormer-B5 baseline across key segmentation metrics:
| Metric | SegFormer-B5 | CrispFormer | Δ |
|---|---|---|---|
| mIoU (SS) | 72.3 ± 0.30 | 78.2 ± 0.22 | +5.9 |
| Boundary-F1 (2 px) | 73.2 | 76.4 | +3.2 |
| Small-Obj Recall | ~48.5% | ~56.7% | +8.2% |
- Small-Object Recall: DMF adaptively allocates more attention to high-resolution features near small objects, yielding a ~8–10 percentage point increase in recall for objects pixels.
- Supervised Performance Gap: CrispFormer achieves 78.2 mIoU, retaining approximately 92.4% of fully supervised SegFormer-B5 (84.6 mIoU) performance using only image-level supervision.
- Ablations: Progressive addition of DMF, variance, UGR, boundary (Bnd), uncertainty modulation (U-DMF), and EMA components yields consistent mIoU improvements, with the complete model A6 attaining the highest scores.
- Computational Overhead: The parameter and inference costs of CrispFormer are marginally above baseline: 82.8M parameters, 86.1G FLOPs, and 17.7ms latency on inputs versus 82.1M, 84.6G, and 17.4ms for SegFormer-B5.
CrispFormer achieves these gains without reliance on heavy post-processing or test-time CRF, establishing a reproducible standard for high-fidelity WSSS masks (Torabi et al., 24 Nov 2025).
7. Significance and Applications
The decoder-centric formulation of CrispFormer demonstrates that targeted architectural changes—spatially adaptive feature fusion, fine-grained boundary supervision, and pixel-level uncertainty modeling—can dramatically improve segmentation quality under weak supervision. The method is compatible with most SegFormer variants and can be easily adopted in WSSS pipelines. By improving boundary accuracy, small-object recall, and robustness to label noise, CrispFormer addresses persistent challenges in image-level supervised segmentation while maintaining operational efficiency (Torabi et al., 24 Nov 2025).