PP-LiteSeg: Real-Time Semantic Segmentation
- The paper introduces PP-LiteSeg, a semantic segmentation framework that employs a hierarchical backbone and novel modules (FLD, UAFM, SPPM) for efficient feature fusion and balanced computation.
- Its Flexible and Lightweight Decoder staggers channel widths across resolutions to optimize computational load, while UAFM and SPPM enhance context aggregation with minimal overhead.
- Evaluated on Cityscapes and CamVid, PP-LiteSeg demonstrates state-of-the-art speed–accuracy tradeoffs, achieving up to 77.5% mIoU with high FPS performance.
PP-LiteSeg is a real-time semantic segmentation framework developed for high-throughput and low-latency scenarios, following an encoder–aggregation–decoder design paradigm. It introduces three key architectural innovations: the Flexible and Lightweight Decoder (FLD), the Unified Attention Fusion Module (UAFM), and the Simple Pyramid Pooling Module (SPPM). These components collectively facilitate efficient feature fusion, balanced computational load across resolution scales, and context aggregation with minimal overhead, attaining state-of-the-art speed–accuracy tradeoffs on benchmarks such as Cityscapes and CamVid (Peng et al., 2022).
1. Network Architecture and Model Variants
PP-LiteSeg employs a hierarchical backbone, STDCNet, which generates feature maps at four spatial strides: 1/4, 1/8, 1/16, and 1/32 of the input resolution. Two primary variants are defined:
- PP-LiteSeg-T: Utilizes STDC1 with decoder channel widths {32, 64, 128}.
- PP-LiteSeg-B: Utilizes STDC2 with decoder channel widths {64, 96, 128}.
The deepest backbone feature (1/32) initially passes through SPPM for global context enrichment. The processed features are progressively up-sampled and fused with skip-connection outputs at 1/16 and 1/8 scales via UAFM blocks within the decoder. Feature channels are reduced at each fusion stage via 1×1 convolutions before mapping to class logits and restoring full resolution through final up-sampling.
| Variant | Backbone | Decoder Channels | mIoU (%) | FPS |
|---|---|---|---|---|
| PP-LiteSeg-T | STDC1 | 32, 64, 128 | 72.0 | 273.6 |
| PP-LiteSeg-B | STDC2 | 64, 96, 128 | 73.9 | 195.3 |
2. Unified Attention Fusion Module (UAFM)
UAFM performs weighted fusion of high-level up-sampled and low-level feature tensors of shape . Its mathematical formulation is:
Let and ,
- Upsample high-level features:
- Compute attention weight via
where denotes sigmoid activation.
- Fuse features:
Spatial Attention: For each pixel, channel mean and max of both inputs are concatenated, processed by a convolution, and passed through a sigmoid:
Channel Attention: Global average and max pooling of both tensors are concatenated, followed by a convolution and sigmoid:
PP-LiteSeg predominantly applies the spatial variant () to minimize overhead.
3. Flexible and Lightweight Decoder (FLD)
FLD addresses load imbalance observed in conventional decoders where fixed channel counts result in shallow/high-resolution layers dominating computational cost. Instead, FLD staggers channels to mirror encoder scaling, reducing channel width at higher resolutions:
- Decoder progression (PP-LiteSeg-T): {32→64→128} channels at {1/8→1/16→1/32} spatial scales.
- Each step: Bilinear up-sampling, UAFM fusion with corresponding encoder output, and a 1×1 Conv–BN–ReLU for channel adjustment.
This design produces balanced FLOPs across decoder stages. The final segmentation head comprises a 1×1 Conv–BN–ReLU, mapping to the number of semantic classes; full image resolution is restored via 4× up-sampling.
4. Simple Pyramid Pooling Module (SPPM)
SPPM aggregates global contextual features from the 1/32 backbone output. It applies three parallel pooling operations with bin sizes :
The pooled features are summed and processed through a convolution: Intermediate channel widths are halved to restrict SPPM-induced overhead below 1 ms on NVIDIA 1080Ti.
5. Training and Inference Protocols
PP-LiteSeg's training regimen employs the Cityscapes (19 classes) and CamVid (11 classes) datasets:
- Input crops: (Cityscapes), (CamVid)
- Augmentation: Random scaling ([0.125, 1.5] for Cityscapes; [0.5, 2.5] for CamVid), random flip, color jitter
- Optimization: SGD (momentum=0.9, weight-decay=), “poly” learning rate schedule with warm-up; initial LR=0.005, 160k iterations (Cityscapes), LR=0.01, 1k iterations (CamVid)
- Loss: Pixel-wise cross-entropy with Online Hard Example Mining (OHEM)
- Inference: Models are exported as ONNX for TensorRT acceleration (CUDA 10.2, CuDNN 7.6, NVIDIA 1080Ti). Cityscapes test utilizes two scales (, ) with up-sampling; CamVid employs native resolution. Metrics reported include mean Intersection over Union (mIoU) and Frames per Second (FPS), accounting for all pre/post-processing.
6. Evaluation and Comparative Results
PP-LiteSeg attains leading speed–accuracy tradeoffs on Cityscapes:
| Model | Input Size | mIoU (%) | FPS |
|---|---|---|---|
| PP-LiteSeg-T | 512×1024 | 72.0 | 273.6 |
| PP-LiteSeg-B | 512×1024 | 73.9 | 195.3 |
| PP-LiteSeg-T | 768×1536 | 74.9 | 143.6 |
| PP-LiteSeg-B | 768×1536 | 77.5 | 102.6 |
| BiSeNet V2 | 512×1024 | 72.6 | 156 |
| STDC2-Seg75 | 768×1536 | 76.8 | 97 |
This demonstrates superior throughput relative to prior methods at equivalent or higher accuracy.
7. Ablation Analysis and Impact of Innovations
Ablation studies on Cityscapes-val (PP-LiteSeg-B2, 768×1536) quantify module contributions:
- Addition of FLD: +0.17% mIoU (77.50→77.67), FPS drops from 110.9 to 109.7
- Addition of SPPM: +0.09% mIoU (77.67→77.76), FPS from 109.7 to 106.3
- Addition of UAFM: +0.22% mIoU (77.76→77.98), FPS from 106.3 to 105.5
- All modules: +0.71% mIoU (77.50→78.21), FPS from 110.9 to 102.6
Qualitative examples indicate enhanced boundary delineation and artifact reduction upon sequential addition of each proposed component. Collectively, the FLD, SPPM, and UAFM modules yield state-of-the-art real-time semantic segmentation results on Cityscapes and CamVid, supporting PP-LiteSeg's efficacy in latency-sensitive environments (Peng et al., 2022).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free