Pixel Aggregation Network for Scene Text Detection
- Pixel Aggregation Network (PAN) is a CNN architecture that efficiently detects arbitrarily shaped text by combining segmentation and learnable pixel aggregation.
- It leverages a lightweight segmentation head based on ResNet-18 and enhanced feature pyramid modules (FPEM and FFM) to extract multi-scale features.
- The network employs a novel pixel aggregation module that clusters text pixels using learned embedding vectors to generate precise text instance masks.
The Pixel Aggregation Network (PAN) is a convolutional neural network architecture designed for efficient and accurate detection of arbitrarily shaped text in images, with specific innovations for real-time scene text detection and spotting. PAN addresses two major challenges in the field: the trade-off between detection accuracy and inference speed, and the robust modeling of text with complex, non-rectangular shapes. The framework is founded upon a segmentation-plus-aggregation pipeline, comprising a lightweight segmentation head that predicts pixel-level text information, and a learnable post-processing module that clusters text pixels into instances using embedding vectors. Its core contributions include the Feature Pyramid Enhancement Module (FPEM), the Feature Fusion Module (FFM), and the Pixel Aggregation (PA) step. The architecture has also been extended to an end-to-end text spotting system (PAN++), incorporating recognition in addition to detection (Wang et al., 2019, &&&1&&&).
1. Architectural Principles and Pipeline
PAN is structured as a two-stage architecture:
- Segmentation Head: Built atop a ResNet-18 backbone, the head produces three outputs for each pixel at stride 1/4 the input resolution:
- A text-region probability map
- A text-kernel probability map (corresponding to a shrunken core of each text instance)
- A -dimensional similarity vector field ( in experiments)
- Pixel Aggregation Module: A learnable, post-processing step that clusters pixels into text instances by their embedding proximity in the learned feature space, starting from kernel seeds and expanding outward (Wang et al., 2019, Wang et al., 2021).
The architecture leverages a thin feature pyramid extracted from ResNet-18 (conv2–conv5, reduced to 128 channels each), which is enhanced via stacked FPEMs and fused by the FFM. The overall design supports real-time performance while maintaining segmentation fidelity for arbitrarily shaped and densely packed texts.
2. Feature Pyramid Enhancement and Fusion
The feature pyramid and its enhancement are central to PAN's efficiency:
- FPEM: Each Feature Pyramid Enhancement Module processes thin features through a two-phase U-shaped path:
- Top-down (up-scale): For level , upsample and integrate higher-level features via depthwise separable convolutions and bilinear upsampling.
- Bottom-up (down-scale): For , perform analogous downsampling.
- The output is an enhanced set .
- Cascading FPEMs yields improved multi-scale feature extraction at minimal computational cost (~20% FLOPs of standard FPN per FPEM) (Wang et al., 2019).
- FFM: The Feature Fusion Module aggregates outputs from all FPEM cascades:
Features for all scales are upsampled to stride-4, concatenated channel-wise, and fed to the segmentation head, resulting in .
This multi-path, lightweight enhancement enables aggregation of multi-level semantic details necessary for delineating curve and non-standard text shapes at high speed.
3. Segmentation Outputs and Loss Function Design
The segmentation head computes three outputs per pixel:
- : probability map for the overall text region
- : probability map for the "kernel," a shrunken region ensuring separation between closely packed instances
- : a learned similarity vector
Losses are devised to address class imbalance and instance separation:
- Dice Loss for pixel-wise text and kernel prediction:
- Aggregation loss () pulls embeddings of a text instance towards their kernel:
with kernel centroid and margin .
- Discrimination loss () enforces separation of kernels:
().
The full loss is: These losses collectively enable accurate segmentation, robust instance separation, and efficient learning.
4. Pixel Aggregation Post-Processing
Traditional methods based on morphological connected components or NMS are limited in handling complex and adjacent instances. PAN replaces non-learnable post-processing by a differentiable pixel aggregation module:
- Kernel seeds are identified by thresholding and extracting connected components.
- A region-growing algorithm assigns peripheral text-region pixels to kernels whose embedding centroid is closest in the vector space, with a distance threshold (e.g., ).
- The process iterates until convergence, yielding final instance masks.
This aggregation mechanism facilitates precise detection of arbitrarily shaped, dense, and curved text, while preserving real-time throughput (Wang et al., 2019, Wang et al., 2021).
5. Kernel Representation and PAN++
PAN++ extends the formulation to end-to-end text spotting:
- The "kernel representation" decomposes each annotation into a central kernel and a peripheral region .
- The margin for shrinking kernels is
where is the original polygon and the shrink ratio.
- All outputs are predicted via a single, fully convolutional network.
For text recognition, PAN++ incorporates:
- A recognition head based on a Masked RoI mechanism: for each detected instance, the corresponding region in is cropped, masked, and resized.
- An attention-based sequence decoder: multi-head attention and LSTM predict the text sequence.
The total loss comprises detection and recognition components: where
PAN++ delivers state-of-the-art end-to-end performance for arbitrarily shaped text lines while maintaining real-time inference (Wang et al., 2021).
6. Experimental Evaluation and Benchmarks
PAN achieves a strong trade-off between accuracy and speed across key benchmarks:
| Dataset | Precision | Recall | F-Measure | FPS |
|---|---|---|---|---|
| CTW1500@320 | 82.7% | 77.4% | 79.9% | 84.2 |
| CTW1500@640 | 86.4% | 81.2% | 83.7% | 39.8 |
| Total-Text@640 | 89.3% | 81.0% | 85.0% | 39.6 |
| IC15@736 | 82.9% | 77.8% | 80.3% | 26.1 |
| MSRA-TD500@736 | 80.7% | 77.3% | 78.9% | 30.2 |
PAN++ achieves an end-to-end text spotting F-measure of 64.9 at 29.2 FPS on Total-Text, surpassing previous methods (Wang et al., 2021).
Strengths:
- Robustness to curved, multi-oriented, variably sized, and tightly packed text
- Real-time inference (up to 84 FPS, 320px), even with a lightweight ResNet-18 + 2×FPEM + FFM backbone
- Superior or matched accuracy compared to much heavier segmentation models
Failure cases include: extremely large inter-character spacing, uncommon symbols, and challenging, cluttered backgrounds.
7. Limitations and Research Outlook
While PAN demonstrates high efficiency and segmentation accuracy, limitations remain:
- Detection may fail with large character gaps or uncommon symbols not in training data.
- Improved grouping strategies and extended embedding dimensions could enhance instance separation.
- Integration of LLMs or joint detection/recognition may address failure modes related to complex scripts or symbols.
- Further optimization of GPU-parallel clustering in the aggregation step can improve inference speed.
- Application to domains beyond scene text, where instance-level segmentation with embedded similarity cues is relevant, is a plausible area for future research.
PAN and PAN++ introduce a principled, efficient, and extensible framework for scene text detection, with kernel-based segmentation and learnable pixel aggregation at their core (Wang et al., 2019, Wang et al., 2021).