Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pixel Aggregation Network for Scene Text Detection

Updated 25 February 2026
  • Pixel Aggregation Network (PAN) is a CNN architecture that efficiently detects arbitrarily shaped text by combining segmentation and learnable pixel aggregation.
  • It leverages a lightweight segmentation head based on ResNet-18 and enhanced feature pyramid modules (FPEM and FFM) to extract multi-scale features.
  • The network employs a novel pixel aggregation module that clusters text pixels using learned embedding vectors to generate precise text instance masks.

The Pixel Aggregation Network (PAN) is a convolutional neural network architecture designed for efficient and accurate detection of arbitrarily shaped text in images, with specific innovations for real-time scene text detection and spotting. PAN addresses two major challenges in the field: the trade-off between detection accuracy and inference speed, and the robust modeling of text with complex, non-rectangular shapes. The framework is founded upon a segmentation-plus-aggregation pipeline, comprising a lightweight segmentation head that predicts pixel-level text information, and a learnable post-processing module that clusters text pixels into instances using embedding vectors. Its core contributions include the Feature Pyramid Enhancement Module (FPEM), the Feature Fusion Module (FFM), and the Pixel Aggregation (PA) step. The architecture has also been extended to an end-to-end text spotting system (PAN++), incorporating recognition in addition to detection (Wang et al., 2019, &&&1&&&).

1. Architectural Principles and Pipeline

PAN is structured as a two-stage architecture:

  1. Segmentation Head: Built atop a ResNet-18 backbone, the head produces three outputs for each pixel at stride 1/4 the input resolution:
    • A text-region probability map PtexP_\text{tex}
    • A text-kernel probability map PkerP_\text{ker} (corresponding to a shrunken core of each text instance)
    • A DD-dimensional similarity vector field F(p)RD\mathcal{F}(p)\in\mathbb{R}^D (D=4D=4 in experiments)
  2. Pixel Aggregation Module: A learnable, post-processing step that clusters pixels into text instances by their embedding proximity in the learned feature space, starting from kernel seeds and expanding outward (Wang et al., 2019, Wang et al., 2021).

The architecture leverages a thin feature pyramid extracted from ResNet-18 (conv2–conv5, reduced to 128 channels each), which is enhanced via stacked FPEMs and fused by the FFM. The overall design supports real-time performance while maintaining segmentation fidelity for arbitrarily shaped and densely packed texts.

2. Feature Pyramid Enhancement and Fusion

The feature pyramid and its enhancement are central to PAN's efficiency:

  • FPEM: Each Feature Pyramid Enhancement Module processes thin features {f4,f3,f2,f1}\{f_4, f_3, f_2, f_1\} through a two-phase U-shaped path:
    • Top-down (up-scale): For level s=41s=4\to1, upsample and integrate higher-level features via depthwise separable convolutions and bilinear upsampling.
    • Bottom-up (down-scale): For s=14s=1\to4, perform analogous downsampling.
    • The output is an enhanced set {f4,f3,f2,f1}\{f''_4, f''_3, f''_2, f''_1\}.
    • Cascading nc=2n_c=2 FPEMs yields improved multi-scale feature extraction at minimal computational cost (~20% FLOPs of standard FPN per FPEM) (Wang et al., 2019).
  • FFM: The Feature Fusion Module aggregates outputs from all FPEM cascades:

Ss=i=1ncFs(i)S_s = \sum_{i=1}^{n_c} F_s^{(i)}

Features for all scales are upsampled to stride-4, concatenated channel-wise, and fed to the segmentation head, resulting in FfRH/4×W/4×512F_f\in\mathbb{R}^{H/4 \times W/4 \times 512}.

This multi-path, lightweight enhancement enables aggregation of multi-level semantic details necessary for delineating curve and non-standard text shapes at high speed.

3. Segmentation Outputs and Loss Function Design

The segmentation head computes three outputs per pixel:

  • PtexP_\text{tex}: probability map for the overall text region
  • PkerP_\text{ker}: probability map for the "kernel," a shrunken region ensuring separation between closely packed instances
  • F(p)\mathcal{F}(p): a learned similarity vector

Losses are devised to address class imbalance and instance separation:

  • Dice Loss for pixel-wise text and kernel prediction:

Ltex=12iPtex(i)Gtex(i)iPtex(i)2+iGtex(i)2L_\text{tex} = 1 - \frac{2 \sum_i P_\text{tex}(i)\, G_\text{tex}(i)}{\sum_i P_\text{tex}(i)^2 + \sum_i G_\text{tex}(i)^2}

Lker=12iPker(i)Gker(i)iPker(i)2+iGker(i)2L_\text{ker} = 1 - \frac{2 \sum_i P_\text{ker}(i)\, G_\text{ker}(i)}{\sum_i P_\text{ker}(i)^2 + \sum_i G_\text{ker}(i)^2}

  • Aggregation loss (LaggL_\text{agg}) pulls embeddings of a text instance towards their kernel:

Lagg=1Ni=1N1TipTi[max(0,F(p)Gi2δagg)]2L_\text{agg} = \frac{1}{N} \sum_{i=1}^N \frac{1}{|T_i|} \sum_{p\in T_i} \left[\max(0, \|\mathcal{F}(p)-\mathcal{G}_i\|_2 - \delta_\text{agg})\right]^2

with kernel centroid Gi=1KiqKiF(q)\mathcal{G}_i = \frac{1}{|K_i|} \sum_{q\in K_i} \mathcal{F}(q) and margin δagg=0.5\delta_\text{agg}=0.5.

  • Discrimination loss (LdisL_\text{dis}) enforces separation of kernels:

Ldis=1N(N1)ij[max(0,δdisGiGj2)]2L_\text{dis} = \frac{1}{N(N-1)} \sum_{i\neq j} [\max(0, \delta_\text{dis} - \|\mathcal{G}_i-\mathcal{G}_j\|_2)]^2

(δdis=3\delta_\text{dis}=3).

The full loss is: L=Ltex+αLker+β(Lagg+Ldis),α=0.5,β=0.25L = L_\text{tex} + \alpha L_\text{ker} + \beta (L_\text{agg} + L_\text{dis}), \quad \alpha=0.5,\,\beta=0.25 These losses collectively enable accurate segmentation, robust instance separation, and efficient learning.

4. Pixel Aggregation Post-Processing

Traditional methods based on morphological connected components or NMS are limited in handling complex and adjacent instances. PAN replaces non-learnable post-processing by a differentiable pixel aggregation module:

  • Kernel seeds are identified by thresholding PkerP_\text{ker} and extracting connected components.
  • A region-growing algorithm assigns peripheral text-region pixels to kernels whose embedding centroid is closest in the vector space, with a distance threshold (e.g., d6d\approx6).
  • The process iterates until convergence, yielding final instance masks.

This aggregation mechanism facilitates precise detection of arbitrarily shaped, dense, and curved text, while preserving real-time throughput (Wang et al., 2019, Wang et al., 2021).

5. Kernel Representation and PAN++

PAN++ extends the formulation to end-to-end text spotting:

  • The "kernel representation" decomposes each annotation into a central kernel KK and a peripheral region TKT\setminus K.
  • The margin mm for shrinking kernels is

m=Area(bo)(1r2)Perimeter(bo)m = \frac{\text{Area}(b_o)\, \bigl(1-r^2\bigr)}{\text{Perimeter}(b_o)}

where bob_o is the original polygon and rr the shrink ratio.

  • All outputs are predicted via a single, fully convolutional network.

For text recognition, PAN++ incorporates:

  • A recognition head based on a Masked RoI mechanism: for each detected instance, the corresponding region in FfF_f is cropped, masked, and resized.
  • An attention-based sequence decoder: multi-head attention and LSTM predict the text sequence.

The total loss comprises detection and recognition components: L=Ldet+Lrec\mathcal{L} = \mathcal{L}_\text{det} + \mathcal{L}_\text{rec} where

Lrec=1wiCrossEntropy(yi,wi)\mathcal{L}_\text{rec} = \frac{1}{|w|}\sum_i \text{CrossEntropy}(y_i, w_i)

PAN++ delivers state-of-the-art end-to-end performance for arbitrarily shaped text lines while maintaining real-time inference (Wang et al., 2021).

6. Experimental Evaluation and Benchmarks

PAN achieves a strong trade-off between accuracy and speed across key benchmarks:

Dataset Precision Recall F-Measure FPS
CTW1500@320 82.7% 77.4% 79.9% 84.2
CTW1500@640 86.4% 81.2% 83.7% 39.8
Total-Text@640 89.3% 81.0% 85.0% 39.6
IC15@736 82.9% 77.8% 80.3% 26.1
MSRA-TD500@736 80.7% 77.3% 78.9% 30.2

PAN++ achieves an end-to-end text spotting F-measure of 64.9 at 29.2 FPS on Total-Text, surpassing previous methods (Wang et al., 2021).

Strengths:

  • Robustness to curved, multi-oriented, variably sized, and tightly packed text
  • Real-time inference (up to 84 FPS, 320px), even with a lightweight ResNet-18 + 2×FPEM + FFM backbone
  • Superior or matched accuracy compared to much heavier segmentation models

Failure cases include: extremely large inter-character spacing, uncommon symbols, and challenging, cluttered backgrounds.

7. Limitations and Research Outlook

While PAN demonstrates high efficiency and segmentation accuracy, limitations remain:

  • Detection may fail with large character gaps or uncommon symbols not in training data.
  • Improved grouping strategies and extended embedding dimensions could enhance instance separation.
  • Integration of LLMs or joint detection/recognition may address failure modes related to complex scripts or symbols.
  • Further optimization of GPU-parallel clustering in the aggregation step can improve inference speed.
  • Application to domains beyond scene text, where instance-level segmentation with embedded similarity cues is relevant, is a plausible area for future research.

PAN and PAN++ introduce a principled, efficient, and extensible framework for scene text detection, with kernel-based segmentation and learnable pixel aggregation at their core (Wang et al., 2019, Wang et al., 2021).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pixel Aggregation Network (PAN).