PP-LiteSeg: Real-Time Semantic Segmentation

Updated 21 November 2025

The paper introduces PP-LiteSeg, a semantic segmentation framework that employs a hierarchical backbone and novel modules (FLD, UAFM, SPPM) for efficient feature fusion and balanced computation.
Its Flexible and Lightweight Decoder staggers channel widths across resolutions to optimize computational load, while UAFM and SPPM enhance context aggregation with minimal overhead.
Evaluated on Cityscapes and CamVid, PP-LiteSeg demonstrates state-of-the-art speed–accuracy tradeoffs, achieving up to 77.5% mIoU with high FPS performance.

PP-LiteSeg is a real-time semantic segmentation framework developed for high-throughput and low-latency scenarios, following an encoder–aggregation–decoder design paradigm. It introduces three key architectural innovations: the Flexible and Lightweight Decoder (FLD), the Unified Attention Fusion Module (UAFM), and the Simple Pyramid Pooling Module (SPPM). These components collectively facilitate efficient feature fusion, balanced computational load across resolution scales, and context aggregation with minimal overhead, attaining state-of-the-art speed–accuracy tradeoffs on benchmarks such as Cityscapes and CamVid (Peng et al., 2022).

1. Network Architecture and Model Variants

PP-LiteSeg employs a hierarchical backbone, STDCNet, which generates feature maps at four spatial strides: 1/4, 1/8, 1/16, and 1/32 of the input resolution. Two primary variants are defined:

PP-LiteSeg-T: Utilizes STDC1 with decoder channel widths {32, 64, 128}.
PP-LiteSeg-B: Utilizes STDC2 with decoder channel widths {64, 96, 128}.

The deepest backbone feature (1/32) initially passes through SPPM for global context enrichment. The processed features are progressively up-sampled and fused with skip-connection outputs at 1/16 and 1/8 scales via UAFM blocks within the decoder. Feature channels are reduced at each fusion stage via 1×1 convolutions before mapping to class logits and restoring full resolution through final up-sampling.

Variant	Backbone	Decoder Channels	mIoU (%)	FPS
PP-LiteSeg-T	STDC1	32, 64, 128	72.0	273.6
PP-LiteSeg-B	STDC2	64, 96, 128	73.9	195.3

2. Unified Attention Fusion Module (UAFM)

UAFM performs weighted fusion of high-level up-sampled and low-level feature tensors of shape $C \times H \times W$ . Its mathematical formulation is:

Let $F_{\text{high}} \in \mathbb{R}^{C \times H' \times W'}$ and $F_{\text{low}} \in \mathbb{R}^{C \times H \times W}$ ,

Upsample high-level features:

$F_{\text{up}} = \mathrm{Upsample}(F_{\text{high}}),\quad F_{\text{up}} \in \mathbb{R}^{C \times H \times W}$

Compute attention weight $W$ via

$W = \alpha = \sigma(f(F_{\text{up}}, F_{\text{low}))},$

where $\sigma$ denotes sigmoid activation.

Fuse features:

$F_{\text{out}} = W \odot F_{\text{up}} + (1-W) \odot F_{\text{low}}$

Spatial Attention: For each pixel, channel mean and max of both inputs are concatenated, processed by a $7 \times 7$ convolution, and passed through a sigmoid: $\begin{aligned} M_1 &= \mathrm{Mean}_{\rm ch}(F_{\text{up}}), \quad M_2 = \mathrm{Max}_{\rm ch}(F_{\text{up}}), \ M_3 &= \mathrm{Mean}_{\rm ch}(F_{\text{low}}), \quad M_4 = \mathrm{Max}_{\rm ch}(F_{\text{low}}), \ F_{\text{cat}} &= \mathrm{Concat}[M_1, M_2, M_3, M_4] \in \mathbb{R}^{4 \times H \times W}, \ A_s &= \sigma(\mathrm{Conv}_{7 \times 7}(F_{\text{cat}})) \in \mathbb{R}^{1 \times H \times W} \end{aligned}$

Channel Attention: Global average and max pooling of both tensors are concatenated, followed by a $1 \times 1$ convolution and sigmoid: $\begin{aligned} C_1 &= \mathrm{GAP}(F_{\text{up}}), \quad C_2 = \mathrm{GMP}(F_{\text{up}}), \ C_3 &= \mathrm{GAP}(F_{\text{low}}), \quad C_4 = \mathrm{GMP}(F_{\text{low}}), \ F_{\text{cat}} &= \mathrm{Concat}[C_1, C_2, C_3, C_4] \in \mathbb{R}^{4C \times 1 \times 1}, \ A_c &= \sigma(\mathrm{Conv}_{1 \times 1}(F_{\text{cat}})) \in \mathbb{R}^{C \times 1 \times 1} \end{aligned}$

PP-LiteSeg predominantly applies the spatial variant ( $W = A_s$ ) to minimize overhead.

3. Flexible and Lightweight Decoder (FLD)

FLD addresses load imbalance observed in conventional decoders where fixed channel counts result in shallow/high-resolution layers dominating computational cost. Instead, FLD staggers channels to mirror encoder scaling, reducing channel width at higher resolutions:

Decoder progression (PP-LiteSeg-T): {32→64→128} channels at {1/8→1/16→1/32} spatial scales.
Each step: Bilinear up-sampling, UAFM fusion with corresponding encoder output, and a 1×1 Conv–BN–ReLU for channel adjustment.

This design produces balanced FLOPs across decoder stages. The final segmentation head comprises a 1×1 Conv–BN–ReLU, mapping to the number of semantic classes; full image resolution is restored via 4× up-sampling.

4. Simple Pyramid Pooling Module (SPPM)

SPPM aggregates global contextual features from the 1/32 backbone output. It applies three parallel pooling operations with bin sizes $\{1 \times 1, 2 \times 2, 4 \times 4\}$ :

$X_i = \mathrm{Upsample}(\mathrm{Conv}_{1 \times 1}(\mathrm{Pool}_{b_i}(F_{1/32}))),\quad i=1,2,3$

The pooled features are summed and processed through a $1 \times 1$ convolution: $F_{\text{sppm}} = \mathrm{Conv}_{1\times1}\left(\sum_{i=1}^3 X_i\right)$ Intermediate channel widths are halved to restrict SPPM-induced overhead below 1 ms on NVIDIA 1080Ti.

5. Training and Inference Protocols

PP-LiteSeg's training regimen employs the Cityscapes (19 classes) and CamVid (11 classes) datasets:

Input crops: $1024 \times 512$ (Cityscapes), $960 \times 720$ (CamVid)
Augmentation: Random scaling ([0.125, 1.5] for Cityscapes; [0.5, 2.5] for CamVid), random flip, color jitter
Optimization: SGD (momentum=0.9, weight-decay= $5 \times 10^{-4}$ ), “poly” learning rate schedule with warm-up; initial LR=0.005, 160k iterations (Cityscapes), LR=0.01, 1k iterations (CamVid)
Loss: Pixel-wise cross-entropy with Online Hard Example Mining (OHEM)
Inference: Models are exported as ONNX for TensorRT acceleration (CUDA 10.2, CuDNN 7.6, NVIDIA 1080Ti). Cityscapes test utilizes two scales ( $1024 \times 512$ , $1536 \times 768$ ) with up-sampling; CamVid employs native resolution. Metrics reported include mean Intersection over Union (mIoU) and Frames per Second (FPS), accounting for all pre/post-processing.

6. Evaluation and Comparative Results

PP-LiteSeg attains leading speed–accuracy tradeoffs on Cityscapes:

Model	Input Size	mIoU (%)	FPS
PP-LiteSeg-T	512×1024	72.0	273.6
PP-LiteSeg-B	512×1024	73.9	195.3
PP-LiteSeg-T	768×1536	74.9	143.6
PP-LiteSeg-B	768×1536	77.5	102.6
BiSeNet V2	512×1024	72.6	156
STDC2-Seg75	768×1536	76.8	97

This demonstrates superior throughput relative to prior methods at equivalent or higher accuracy.

7. Ablation Analysis and Impact of Innovations

Ablation studies on Cityscapes-val (PP-LiteSeg-B2, 768×1536) quantify module contributions:

Addition of FLD: +0.17% mIoU (77.50→77.67), FPS drops from 110.9 to 109.7
Addition of SPPM: +0.09% mIoU (77.67→77.76), FPS from 109.7 to 106.3
Addition of UAFM: +0.22% mIoU (77.76→77.98), FPS from 106.3 to 105.5
All modules: +0.71% mIoU (77.50→78.21), FPS from 110.9 to 102.6

Qualitative examples indicate enhanced boundary delineation and artifact reduction upon sequential addition of each proposed component. Collectively, the FLD, SPPM, and UAFM modules yield state-of-the-art real-time semantic segmentation results on Cityscapes and CamVid, supporting PP-LiteSeg's efficacy in latency-sensitive environments (Peng et al., 2022).

PDF Markdown Chat (Pro)

References (1)

PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PP-LiteSeg.