Papers
Topics
Authors
Recent
2000 character limit reached

CrispFormer: Lightweight Transformer for WSSS

Updated 26 November 2025
  • The paper introduces a transformer-based framework that integrates boundary supervision, uncertainty-guided refinement, and dynamic multi-scale fusion to boost segmentation mask fidelity.
  • It employs a MiT encoder paired with a modified decoder that improves fine structure resolution and small-object recall without significant extra computation.
  • Empirical results demonstrate a notable improvement (~5.9 mIoU increase) and enhanced boundary and small-object recall compared to the standard SegFormer baseline.

CrispFormer is a lightweight transformer-based neural network framework for weakly supervised semantic segmentation (WSSS). It is designed to learn accurate, dense segmentation masks from under-specified, noisy cues such as image-level labels or weak pseudo-masks. CrispFormer builds upon the SegFormer architecture by introducing three critical, synergistic modifications to the decoder: a boundary supervision branch, an uncertainty-guided refinement mechanism, and a dynamic multi-scale fusion (DMF) module. These changes yield significant improvements in mask fidelity, notably on fine structures and uncertain regions, while maintaining minimal additional computational cost and no changes to the backbone. CrispFormer integrates seamlessly into standard WSSS pipelines, offering reproducible gains over SegFormer baselines with no requirement for heavy post-processing or CRF-based refinement (Torabi et al., 24 Nov 2025).

1. Architectural Structure

CrispFormer architecture comprises a MiT (Mix Transformer) encoder and an enhanced decoder operating at 1/4 spatial resolution.

  • MiT Encoder: The encoder remains unchanged from SegFormer, where input xR3×H×Wx \in \mathbb{R}^{3 \times H \times W} is passed through a pretrained MiT-B5 backbone, producing a pyramid of four feature maps:
    • C1RE1×H/4×W/4C_1 \in \mathbb{R}^{E_1 \times H/4 \times W/4},
    • C2RE2×H/8×W/8C_2 \in \mathbb{R}^{E_2 \times H/8 \times W/8},
    • C3RE3×H/16×W/16C_3 \in \mathbb{R}^{E_3 \times H/16 \times W/16},
    • C4RE4×H/32×W/32C_4 \in \mathbb{R}^{E_4 \times H/32 \times W/32}.
  • CrispFormer Decoder: The decoder comprises three interconnected branches:

    1. Dynamic Multi-Scale Fusion (DMF): Projects and upsamples each encoder feature to EiRE×H/4×W/4E_i \in \mathbb{R}^{E \times H/4 \times W/4}, then fuses via spatial softmax over scales:

    si=Conv1×1(Ei),wi(u,v)=exp(si(u,v))j=14exp(sj(u,v)),F(u,v)=i=14wi(u,v)Ei(u,v)s_i = \mathrm{Conv}_{1 \times 1}(E_i), \quad w_i(u, v) = \frac{\exp(s_i(u, v))}{\sum_{j=1}^4 \exp(s_j(u, v))}, \quad F(u, v) = \sum_{i=1}^4 w_i(u, v) E_i(u, v)

  1. Uncertainty-Guided Refiner (UGR): Predicts per-pixel aleatoric variance, aggregates it, and uses these uncertainty maps to gate residual corrections on the class logits.
  2. Boundary Head: Supervises thin object boundaries by regressing a single-channel boundary map with a lightweight head.

The entire procedure, including segmentation, variance, refinement, and boundary modules, operates in a single forward pass (Torabi et al., 24 Nov 2025).

2. Boundary Supervision Branch

The boundary branch is designed to enhance the CrispFormer’s ability to resolve fine contours and thin structures—commonly eroded by weak supervision or standard decoders.

  • Head: The branch taps the fused feature map FF and applies a 3×33 \times 3 convolution followed by a 1×11 \times 1 projection to produce ElogR1×H/4×W/4E_{\log} \in \mathbb{R}^{1 \times H/4 \times W/4}, which is upsampled to ElogR1×H×WE_{\log}\uparrow \in \mathbb{R}^{1 \times H \times W}.
  • Supervision: Uses a class-agnostic, 2-pixel-wide boundary pseudo-ground-truth band BB from the seed mask Y^\hat Y.
  • Loss: Combines binary cross-entropy (BCE) and Dice loss:

Lbnd=BCE(Elog,B)+Dice(σ(Elog),B)\mathcal{L}_{\mathrm{bnd}} = \mathrm{BCE}(E_{\log}\uparrow, B) + \mathrm{Dice}(\sigma(E_{\log}\uparrow), B)

Optionally, a signed distance function (SDF) penalty on the segmentation output can be included:

Lsdf=1HWu,vP(u,v)1ϕ(Y^)(u,v)\mathcal{L}_{\mathrm{sdf}} = \frac{1}{HW} \sum_{u,v} \|\nabla P^*(u,v)\|_1 |\phi(\hat Y)(u,v)|

where P=softmax(Z)P^* = \mathrm{softmax}(Z^*) and ϕ(Y^)\phi(\hat Y) indicates the signed distance transform of the pseudo-mask.

This explicit supervision enforces crisp, thin contours and mitigates boundary blurring (Torabi et al., 24 Nov 2025).

3. Uncertainty-Guided Refinement Mechanism

The uncertainty-guided refiner directly models pixel-level uncertainty, allowing the network to adaptively correct low-confidence predictions and reweight loss contributions based on both aleatoric and entropy-derived uncertainty.

  • Aleatoric Uncertainty: The variance head predicts per-class log-variance, which is processed as:

σ2=softplus(logσ2)+ε\sigma^2 = \mathrm{softplus}(\log \sigma^2) + \varepsilon

Aggregated across classes:

Uale=1Kc=1Kσc2U_{\mathrm{ale}} = \frac{1}{K} \sum_{c=1}^K \sigma^2_c

  • Residual Correction: Gated residuals are computed using a combination of FF, the softmax probability PP, and uncertainty $U_{\ale}$:

R=concat(F,P,Uale)R = \mathrm{concat}(F, P, U_{\mathrm{ale}})

Δ=ResBlock3×3(R)\Delta = \mathrm{ResBlock}_{3 \times 3}(R)

G=σ(ψ(concat(F,Uale)))G = \sigma(\psi(\mathrm{concat}(F, U_{\mathrm{ale}})))

Z=Z+GΔZ^* = Z + G \odot \Delta

  • Uncertainty-Weighted Loss: The final segmentation loss is weighted per-pixel by normalized combined uncertainty:

U=αU~ale+(1α)U~entU = \alpha\,\widetilde{U}_{\mathrm{ale}} + (1-\alpha)\,\widetilde{U}_{\mathrm{ent}}

w=exp(βU)w = \exp(-\beta U)

Lseg=[CE(Z,Y^)+λdiceDice(Z,Y^)]Mw\mathcal{L}_{\mathrm{seg}} = [\mathrm{CE}(Z^*\uparrow, \hat Y) + \lambda_{\mathrm{dice}}\,\mathrm{Dice}(Z^*\uparrow, \hat Y)] \cdot M \cdot w

The auxiliary heteroscedastic loss for pre-refine logits ZZ is:

Lhet=[CE(Z,Y^)2σ2+12logσ2]M\mathcal{L}_{\mathrm{het}} = \left[ \frac{\mathrm{CE}(Z, \hat Y)}{2 \sigma^2} + \tfrac{1}{2} \log \sigma^2 \right] \cdot M

Combined as:

LUGR=Lseg+λhetLhet\mathcal{L}_{\mathrm{UGR}} = \mathcal{L}_{\mathrm{seg}} + \lambda_{\mathrm{het}} \mathcal{L}_{\mathrm{het}}

This mechanism enables CrispFormer to handle ambiguous and noisy supervision in WSSS more robustly (Torabi et al., 24 Nov 2025).

4. Dynamic Multi-Scale Fusion

The DMF module enables spatially adaptive selection of feature scales, replacing static concatenation in standard decoders. At each spatial location, the module computes a spatial softmax over multi-resolution features, fusing encoder outputs as:

si=Conv1×1(Ei),wi(u,v)=exp(si(u,v))j=14exp(sj(u,v)),F(u,v)=i=14wi(u,v)Ei(u,v)s_i = \mathrm{Conv}_{1\times1}(E_i),\qquad w_i(u,v)=\frac{\exp(s_i(u,v))}{\sum_{j=1}^4\exp(s_j(u,v))},\qquad F(u,v)=\sum_{i=1}^4 w_i(u,v)E_i(u,v)

Optional modulation subtracts scaled downsampled uncertainty UU_\downarrow from sis_i:

sisiαUs_i \leftarrow s_i - \alpha U_\downarrow

This adaptive fusion increases responsiveness to small objects by emphasizing finer scales near detailed content, boosting small-object recall (Torabi et al., 24 Nov 2025).

5. Training Protocol and Integration in WSSS Pipelines

CrispFormer is integrated into a standard three-stage WSSS pipeline:

  • Stage A—Seed Generation: Pseudo masks Y^\hat Y are produced via a separate CAM-based model (e.g., MiT-CAM). High-uncertainty pixels are ignored by a mask MM. Optionally, seeds are post-processed with a one-time CRF or affinity smoothing.
  • Stage B—Student Training: Images are randomly scaled (448–768 pixels) and cropped (512×\times512). The combined loss is:

L=Lseg+λhetLhet+λbndLbnd+λsdfLsdf\mathcal{L}= \mathcal{L}_{\mathrm{seg}} + \lambda_{\mathrm{het}}\mathcal{L}_{\mathrm{het}} + \lambda_{\mathrm{bnd}}\mathcal{L}_{\mathrm{bnd}} + \lambda_{\mathrm{sdf}}\mathcal{L}_{\mathrm{sdf}}

with default hyperparameters: λdice=1\lambda_{\mathrm{dice}} = 1, λhet=0.5\lambda_{\mathrm{het}} = 0.5, λbnd=0.5\lambda_{\mathrm{bnd}} = 0.5, λsdf=0.1\lambda_{\mathrm{sdf}} = 0.1, α=0.5\alpha = 0.5, β=2\beta = 2. AdamW optimizer with weight decay 1×1041 \times 10^{-4}, initial learning rate 6×1056 \times 10^{-5}, backbone LR at 0.1×0.1\times, cosine decay schedule, 1-epoch warm-up, 40–60 epochs.

  • Stage C—EMA Teacher and Pseudo-Label Refresh: An Exponential Moving Average teacher is maintained, and pseudo labels are refreshed every RR epochs by thresholding teacher uncertainty UTU_T; pixels with UT<τuU_T < \tau_u are retained. No test-time CRF or post-processing is applied; inference is a single forward pass (Torabi et al., 24 Nov 2025).

6. Empirical Results

Empirical evaluation on the PASCAL VOC 2012 validation set (single-scale) demonstrates that CrispFormer consistently outperforms the SegFormer-B5 baseline across key segmentation metrics:

Metric SegFormer-B5 CrispFormer Δ
mIoU (SS) 72.3 ± 0.30 78.2 ± 0.22 +5.9
Boundary-F1 (2 px) 73.2 76.4 +3.2
Small-Obj Recall ~48.5% ~56.7% +8.2%
  • Small-Object Recall: DMF adaptively allocates more attention to high-resolution features near small objects, yielding a ~8–10 percentage point increase in recall for objects <32×32< 32 \times 32 pixels.
  • Supervised Performance Gap: CrispFormer achieves 78.2 mIoU, retaining approximately 92.4% of fully supervised SegFormer-B5 (84.6 mIoU) performance using only image-level supervision.
  • Ablations: Progressive addition of DMF, variance, UGR, boundary (Bnd), uncertainty modulation (U-DMF), and EMA components yields consistent mIoU improvements, with the complete model A6 attaining the highest scores.
  • Computational Overhead: The parameter and inference costs of CrispFormer are marginally above baseline: 82.8M parameters, 86.1G FLOPs, and 17.7ms latency on 5122512^2 inputs versus 82.1M, 84.6G, and 17.4ms for SegFormer-B5.

CrispFormer achieves these gains without reliance on heavy post-processing or test-time CRF, establishing a reproducible standard for high-fidelity WSSS masks (Torabi et al., 24 Nov 2025).

7. Significance and Applications

The decoder-centric formulation of CrispFormer demonstrates that targeted architectural changes—spatially adaptive feature fusion, fine-grained boundary supervision, and pixel-level uncertainty modeling—can dramatically improve segmentation quality under weak supervision. The method is compatible with most SegFormer variants and can be easily adopted in WSSS pipelines. By improving boundary accuracy, small-object recall, and robustness to label noise, CrispFormer addresses persistent challenges in image-level supervised segmentation while maintaining operational efficiency (Torabi et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CrispFormer.