Papers
Topics
Authors
Recent
2000 character limit reached

PP-LiteSeg: Real-Time Semantic Segmentation

Updated 21 November 2025
  • The paper introduces PP-LiteSeg, a semantic segmentation framework that employs a hierarchical backbone and novel modules (FLD, UAFM, SPPM) for efficient feature fusion and balanced computation.
  • Its Flexible and Lightweight Decoder staggers channel widths across resolutions to optimize computational load, while UAFM and SPPM enhance context aggregation with minimal overhead.
  • Evaluated on Cityscapes and CamVid, PP-LiteSeg demonstrates state-of-the-art speed–accuracy tradeoffs, achieving up to 77.5% mIoU with high FPS performance.

PP-LiteSeg is a real-time semantic segmentation framework developed for high-throughput and low-latency scenarios, following an encoder–aggregation–decoder design paradigm. It introduces three key architectural innovations: the Flexible and Lightweight Decoder (FLD), the Unified Attention Fusion Module (UAFM), and the Simple Pyramid Pooling Module (SPPM). These components collectively facilitate efficient feature fusion, balanced computational load across resolution scales, and context aggregation with minimal overhead, attaining state-of-the-art speed–accuracy tradeoffs on benchmarks such as Cityscapes and CamVid (Peng et al., 2022).

1. Network Architecture and Model Variants

PP-LiteSeg employs a hierarchical backbone, STDCNet, which generates feature maps at four spatial strides: 1/4, 1/8, 1/16, and 1/32 of the input resolution. Two primary variants are defined:

  • PP-LiteSeg-T: Utilizes STDC1 with decoder channel widths {32, 64, 128}.
  • PP-LiteSeg-B: Utilizes STDC2 with decoder channel widths {64, 96, 128}.

The deepest backbone feature (1/32) initially passes through SPPM for global context enrichment. The processed features are progressively up-sampled and fused with skip-connection outputs at 1/16 and 1/8 scales via UAFM blocks within the decoder. Feature channels are reduced at each fusion stage via 1×1 convolutions before mapping to class logits and restoring full resolution through final up-sampling.

Variant Backbone Decoder Channels mIoU (%) FPS
PP-LiteSeg-T STDC1 32, 64, 128 72.0 273.6
PP-LiteSeg-B STDC2 64, 96, 128 73.9 195.3

2. Unified Attention Fusion Module (UAFM)

UAFM performs weighted fusion of high-level up-sampled and low-level feature tensors of shape C×H×WC \times H \times W. Its mathematical formulation is:

Let FhighRC×H×WF_{\text{high}} \in \mathbb{R}^{C \times H' \times W'} and FlowRC×H×WF_{\text{low}} \in \mathbb{R}^{C \times H \times W},

  1. Upsample high-level features:

Fup=Upsample(Fhigh),FupRC×H×WF_{\text{up}} = \mathrm{Upsample}(F_{\text{high}}),\quad F_{\text{up}} \in \mathbb{R}^{C \times H \times W}

  1. Compute attention weight WW via

W=α=σ(f(Fup,Flow)),W = \alpha = \sigma(f(F_{\text{up}}, F_{\text{low}))},

where σ\sigma denotes sigmoid activation.

  1. Fuse features:

Fout=WFup+(1W)FlowF_{\text{out}} = W \odot F_{\text{up}} + (1-W) \odot F_{\text{low}}

Spatial Attention: For each pixel, channel mean and max of both inputs are concatenated, processed by a 7×77 \times 7 convolution, and passed through a sigmoid: M1=Meanch(Fup),M2=Maxch(Fup), M3=Meanch(Flow),M4=Maxch(Flow), Fcat=Concat[M1,M2,M3,M4]R4×H×W, As=σ(Conv7×7(Fcat))R1×H×W\begin{aligned} M_1 &= \mathrm{Mean}_{\rm ch}(F_{\text{up}}), \quad M_2 = \mathrm{Max}_{\rm ch}(F_{\text{up}}), \ M_3 &= \mathrm{Mean}_{\rm ch}(F_{\text{low}}), \quad M_4 = \mathrm{Max}_{\rm ch}(F_{\text{low}}), \ F_{\text{cat}} &= \mathrm{Concat}[M_1, M_2, M_3, M_4] \in \mathbb{R}^{4 \times H \times W}, \ A_s &= \sigma(\mathrm{Conv}_{7 \times 7}(F_{\text{cat}})) \in \mathbb{R}^{1 \times H \times W} \end{aligned}

Channel Attention: Global average and max pooling of both tensors are concatenated, followed by a 1×11 \times 1 convolution and sigmoid: C1=GAP(Fup),C2=GMP(Fup), C3=GAP(Flow),C4=GMP(Flow), Fcat=Concat[C1,C2,C3,C4]R4C×1×1, Ac=σ(Conv1×1(Fcat))RC×1×1\begin{aligned} C_1 &= \mathrm{GAP}(F_{\text{up}}), \quad C_2 = \mathrm{GMP}(F_{\text{up}}), \ C_3 &= \mathrm{GAP}(F_{\text{low}}), \quad C_4 = \mathrm{GMP}(F_{\text{low}}), \ F_{\text{cat}} &= \mathrm{Concat}[C_1, C_2, C_3, C_4] \in \mathbb{R}^{4C \times 1 \times 1}, \ A_c &= \sigma(\mathrm{Conv}_{1 \times 1}(F_{\text{cat}})) \in \mathbb{R}^{C \times 1 \times 1} \end{aligned}

PP-LiteSeg predominantly applies the spatial variant (W=AsW = A_s) to minimize overhead.

3. Flexible and Lightweight Decoder (FLD)

FLD addresses load imbalance observed in conventional decoders where fixed channel counts result in shallow/high-resolution layers dominating computational cost. Instead, FLD staggers channels to mirror encoder scaling, reducing channel width at higher resolutions:

  • Decoder progression (PP-LiteSeg-T): {32→64→128} channels at {1/8→1/16→1/32} spatial scales.
  • Each step: Bilinear up-sampling, UAFM fusion with corresponding encoder output, and a 1×1 Conv–BN–ReLU for channel adjustment.

This design produces balanced FLOPs across decoder stages. The final segmentation head comprises a 1×1 Conv–BN–ReLU, mapping to the number of semantic classes; full image resolution is restored via 4× up-sampling.

4. Simple Pyramid Pooling Module (SPPM)

SPPM aggregates global contextual features from the 1/32 backbone output. It applies three parallel pooling operations with bin sizes {1×1,2×2,4×4}\{1 \times 1, 2 \times 2, 4 \times 4\}:

Xi=Upsample(Conv1×1(Poolbi(F1/32))),i=1,2,3X_i = \mathrm{Upsample}(\mathrm{Conv}_{1 \times 1}(\mathrm{Pool}_{b_i}(F_{1/32}))),\quad i=1,2,3

The pooled features are summed and processed through a 1×11 \times 1 convolution: Fsppm=Conv1×1(i=13Xi)F_{\text{sppm}} = \mathrm{Conv}_{1\times1}\left(\sum_{i=1}^3 X_i\right) Intermediate channel widths are halved to restrict SPPM-induced overhead below 1 ms on NVIDIA 1080Ti.

5. Training and Inference Protocols

PP-LiteSeg's training regimen employs the Cityscapes (19 classes) and CamVid (11 classes) datasets:

  • Input crops: 1024×5121024 \times 512 (Cityscapes), 960×720960 \times 720 (CamVid)
  • Augmentation: Random scaling ([0.125, 1.5] for Cityscapes; [0.5, 2.5] for CamVid), random flip, color jitter
  • Optimization: SGD (momentum=0.9, weight-decay=5×1045 \times 10^{-4}), “poly” learning rate schedule with warm-up; initial LR=0.005, 160k iterations (Cityscapes), LR=0.01, 1k iterations (CamVid)
  • Loss: Pixel-wise cross-entropy with Online Hard Example Mining (OHEM)
  • Inference: Models are exported as ONNX for TensorRT acceleration (CUDA 10.2, CuDNN 7.6, NVIDIA 1080Ti). Cityscapes test utilizes two scales (1024×5121024 \times 512, 1536×7681536 \times 768) with up-sampling; CamVid employs native resolution. Metrics reported include mean Intersection over Union (mIoU) and Frames per Second (FPS), accounting for all pre/post-processing.

6. Evaluation and Comparative Results

PP-LiteSeg attains leading speed–accuracy tradeoffs on Cityscapes:

Model Input Size mIoU (%) FPS
PP-LiteSeg-T 512×1024 72.0 273.6
PP-LiteSeg-B 512×1024 73.9 195.3
PP-LiteSeg-T 768×1536 74.9 143.6
PP-LiteSeg-B 768×1536 77.5 102.6
BiSeNet V2 512×1024 72.6 156
STDC2-Seg75 768×1536 76.8 97

This demonstrates superior throughput relative to prior methods at equivalent or higher accuracy.

7. Ablation Analysis and Impact of Innovations

Ablation studies on Cityscapes-val (PP-LiteSeg-B2, 768×1536) quantify module contributions:

  • Addition of FLD: +0.17% mIoU (77.50→77.67), FPS drops from 110.9 to 109.7
  • Addition of SPPM: +0.09% mIoU (77.67→77.76), FPS from 109.7 to 106.3
  • Addition of UAFM: +0.22% mIoU (77.76→77.98), FPS from 106.3 to 105.5
  • All modules: +0.71% mIoU (77.50→78.21), FPS from 110.9 to 102.6

Qualitative examples indicate enhanced boundary delineation and artifact reduction upon sequential addition of each proposed component. Collectively, the FLD, SPPM, and UAFM modules yield state-of-the-art real-time semantic segmentation results on Cityscapes and CamVid, supporting PP-LiteSeg's efficacy in latency-sensitive environments (Peng et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PP-LiteSeg.