FOTS: Fast Oriented Text Spotting with a Unified Network (1801.01671v2)

Published 5 Jan 2018 in cs.CV

Abstract: Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community. Most existing methods treat text detection and recognition as separate tasks. In this work, we propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks. Specially, RoIRotate is introduced to share convolutional features between detection and recognition. Benefiting from convolution sharing strategy, our FOTS has little computation overhead compared to baseline text detection network, and the joint training method learns more generic features to make our method perform better than these two-stage methods. Experiments on ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets demonstrate that the proposed method outperforms state-of-the-art methods significantly, which further allows us to develop the first real-time oriented text spotting system which surpasses all previous state-of-the-art results by more than 5% on ICDAR 2015 text spotting task while keeping 22.6 fps.

PDF Abstract

Fast Oriented Text Spotting with a Unified Network: An Expert Overview

The paper presents an innovative approach to text spotting in natural scenes through the Fast Oriented Text Spotting (FOTS) network. This unified framework integrates text detection and recognition into a single end-to-end trainable system, addressing the drawbacks of conventional two-stage methods that handle these tasks separately. Leveraging shared convolutional features, the proposed architecture optimizes computational efficiency while improving performance, achieving both objectives in real-time.

Core Contributions

The primary contributions of this paper are centered around a novel integration strategy:

Unified Network Design: FOTS employs a single network that simultaneously detects and recognizes text, sharing convolutional features across both tasks. This design significantly reduces computational overhead and improves efficiency compared to traditional two-stage methods.
RoIRotate Operation: A key innovation is the introduction of RoIRotate, a differentiable function enabling the extraction of oriented text features from convolutional maps. RoIRotate allows for seamless transition between detection and recognition, supporting end-to-end learning.
Performance Metrics: On several ICDAR benchmarks (2015, 2017 MLT, and 2013), FOTS demonstrates substantial improvements over state-of-the-art methods. Most notably, it surpasses prior results on ICDAR 2015 by more than 5%, maintaining a speed of 22.6 fps, thus achieving real-time text spotting capabilities.

Methodology

The architecture consists of shared convolutions (utilizing ResNet-50), a text detection branch for bounding boxes, the RoIRotate operation, and a text recognition branch. This integration allows the system to leverage shared features for both tasks:

Text Detection: Utilizing a fully convolutional network (FCN) design, the detection branch predicts text probabilities, bounding box dimensions, and orientations, operating efficiently on dense per-pixel predictions.
Text Recognition: Post-detection, RoIRotate extracts text features, subsequently processed by a sequence-to-sequence model incorporating CNN, LSTM, and CTC decoding for character predictions.

Experimental Results

The paper provides detailed evaluations that affirm the superiority of the FOTS approach. Comparative analysis on standard datasets reveals the following points of interest:

Efficiency: The system maintains a striking balance between speed and accuracy. Compared to traditional methods, FOTS achieves near doubling in processing speed due to feature sharing.
Accuracy Enhancement: By jointly training detection and recognition, FOTS effectively resolves common errors in standalone methods, such as missed detections, false positives, and bounding box misalignments, through enhanced feature representation.
Robustness Across Datasets: FOTS's ability to handle varying orientations and complex backgrounds is validated across datasets, including ICDAR 2015 and the multilingual ICDAR 2017 MLT, highlighting its adaptability to real-world conditions.

Implications and Future Directions

The implications of this work are significant for applications requiring efficient and robust text spotting, including autonomous navigation and real-time document analysis. The introduction of RoIRotate provides a template for future research in unified network designs, potentially extending beyond text spotting to other multi-task learning applications.

Looking forward, research might explore further optimizations in model architecture, reduction in model size without sacrificing performance, and expansion to support additional languages and character sets. The development of more generalized models that integrate additional contextual recognition tasks could outline new paths for exploration in AI-driven document analysis systems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Xuebo Liu (54 papers)
Ding Liang (39 papers)
Shi Yan (32 papers)
Dagui Chen (5 papers)
Yu Qiao (563 papers)
Junjie Yan (109 papers)

Citations (481)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos