Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network (1908.05900v2)

Published 16 Aug 2019 in cs.CV

Abstract: Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical applications.In this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network: An Overview

The domain of scene text detection is a crucial component within computer vision, facilitating a wide range of applications such as text recognition, retrieval, and visual question answering. The paper reviewed here proposes a novel approach to this task through the implementation of the Pixel Aggregation Network (PAN). PAN is designed to address two primary challenges inherent in arbitrary-shaped scene text detection: balancing computational efficiency with detection accuracy, and accurately modeling the varied shapes of text instances.

Methodology

PAN distinguishes itself by utilizing a segmentation pipeline that is both efficient and effective at detecting arbitrary-shaped text. The backbone of PAN is based on a lightweight architecture, specifically ResNet18, which is enhanced by a segmentation head composed of two modules: the Feature Pyramid Enhancement Module (FPEM) and the Feature Fusion Module (FFM).

  1. Feature Pyramid Enhancement Module (FPEM): FPEM is structured as a cascadable U-shaped module, built with depth-wise separable convolutions. This configuration allows the integration of multi-level feature information while maintaining low computational overhead. Critically, by cascading multiple FPEMs, the network's receptive fields are expanded, thereby improving text detection capabilities.
  2. Feature Fusion Module (FFM): FFM synthesizes features across multiple network layers, focusing on harnessing both low-level and high-level semantic information. Through selective feature aggregation, it ensures that the final feature map is representative and reduces the computational load typically associated with feature concatenation prior to segmentation.
  3. Pixel Aggregation (PA): The distinguishing aspect of PAN is the Pixel Aggregation mechanism, which models the task of grouping text region pixels to their respective kernels via a similarity vector. The network learns this vector in a manner that guides neighboring text pixels to the correct kernel, enhancing segmentation precision and enabling the accurate reconstruction of text instances from arbitrary shapes.

Experimental Results

PAN's performance was rigorously validated using multiple benchmarks that target different text detection scenarios, including CTW1500, Total-Text, ICDAR 2015, and MSRA-TD500. The network demonstrated superior F-measure scores on curved text datasets such as CTW1500 (83.7%) and Total-Text (85.0%), with impressive real-time processing speeds up to 84.2 FPS, indicating a pragmatic application potential for real-world deployments.

Implications and Future Directions

This paper makes a noteworthy contribution by presenting a network capable of real-time, high-accuracy, arbitrary-shaped text detection. The implications are vast, providing a feasible solution for real-time applications in autonomous vehicles, augmented reality, and robotics, where both computational efficiency and accuracy are pivotal.

Looking ahead, future research may explore the integration of PAN with NLP models to enable end-to-end text understanding in visual contexts. Additionally, further refinement and training with more diverse datasets could enhance robustness against varied text presentations, such as uncommon fonts, scales, and highly cluttered scenes, pushing the boundaries of what automated scene text detection can accomplish.

In conclusion, PAN represents a significant step forward in the field of scene text detection, offering a balanced solution to the complex problem of efficiently detecting detailed and arbitrary text shapes in dynamic environments. The innovative use of cascaded modules and learnable post-processing presents a template for future work in efficient feature extraction and object segmentation within computer vision.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wenhai Wang (123 papers)
  2. Enze Xie (84 papers)
  3. Xiaoge Song (2 papers)
  4. Yuhang Zang (54 papers)
  5. Wenjia Wang (68 papers)
  6. Tong Lu (85 papers)
  7. Gang Yu (114 papers)
  8. Chunhua Shen (404 papers)
Citations (384)