Papers
Topics
Authors
Recent
Search
2000 character limit reached

TextBoxes++: A Single-Shot Oriented Scene Text Detector

Published 9 Jan 2018 in cs.CV | (1801.02765v3)

Abstract: Scene text detection is an important step of scene text recognition system and also a challenging problem. Different from general object detection, the main challenges of scene text detection lie on arbitrary orientations, small sizes, and significantly variant aspect ratios of text in natural images. In this paper, we present an end-to-end trainable fast scene text detector, named TextBoxes++, which detects arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass. No post-processing other than an efficient non-maximum suppression is involved. We have evaluated the proposed TextBoxes++ on four public datasets. In all experiments, TextBoxes++ outperforms competing methods in terms of text localization accuracy and runtime. More specifically, TextBoxes++ achieves an f-measure of 0.817 at 11.6fps for 1024*1024 ICDAR 2015 Incidental text images, and an f-measure of 0.5591 at 19.8fps for 768*768 COCO-Text images. Furthermore, combined with a text recognizer, TextBoxes++ significantly outperforms the state-of-the-art approaches for word spotting and end-to-end text recognition tasks on popular benchmarks. Code is available at: https://github.com/MhLiao/TextBoxes_plusplus

Citations (679)

Summary

  • The paper introduces TextBoxes++, a single-shot detector that directly predicts oriented text bounding boxes in natural images.
  • Its end-to-end CNN architecture leverages inception-style filters and cascaded NMS, achieving an F-measure of 0.817 on ICDAR 2015.
  • The approach offers efficient and robust detection for multi-oriented texts, setting new benchmarks over methods like EAST and DMPNet.

TextBoxes++: A Single-Shot Oriented Scene Text Detector

The paper presents an advancement in scene text detection through the development of TextBoxes++, a single-shot oriented scene text detector. The emphasis is on providing a technique that achieves high accuracy and efficiency when detecting arbitrary-oriented text within natural images, addressing key challenges such as varying orientations, sizes, and aspect ratios of text.

Technical Summary

TextBoxes++ is inspired by developments in general object detection, particularly the SSD (Single Shot Multibox Detector), and introduces specific adaptations for text. The core of the approach is a fully convolutional neural network that bypasses the complexities associated with traditional text detection methods like character-level analysis or extensive post-processing. Key elements include:

  • Oriented Detection: The detector directly predicts word bounding boxes using quadrilateral or oriented rectangle representations.
  • End-to-End Training: TextBoxes++ is end-to-end trainable, integrating both detection and recognition aspects into a streamlined process.
  • Unique Network Architecture: Incorporates inception-style filters and dense default boxes to handle multi-scale and arbitrary-oriented text effectively. Vertical offsets are employed for dense coverage, enhancing performance on closely spaced text.
  • Improved Data Augmentation: Introduces a novel random cropping strategy optimized for the small sizes typical of text in images, refining training efficacy.
  • Non-Maximum Suppression (NMS): A cascaded NMS approach efficiently merges multi-scale detection outputs, improving speed without sacrificing accuracy.

Performance and Evaluation

The paper provides extensive evaluations on datasets with diverse text orientations, notably ICDAR 2015 Incidental Text and COCO-Text, as well as horizontal datasets like ICDAR 2013 and SVT. TextBoxes++ shows superior performance:

  • Achieves an F-measure of 0.817 on ICDAR 2015 with significant accuracy and speed enhancements over previous methods.
  • Demonstrates versatility across datasets, proving effective for both horizontal and multi-oriented texts.
  • In contrast to competing methods, exhibits a favorable balance between runtime efficiency and detection performance.

Comparative Analysis

TextBoxes++ is contrasted with methods like EAST and DMPNet. It is simpler due to its reliance on horizontal default boxes and a straightforward, efficient network architecture. This simplicity does not compromise performance, as it surpasses these state-of-the-art counterparts on key benchmarks, both in accuracy and runtime.

Implications and Future Directions

The proposed method holds key implications for real-time applications requiring efficient text detection in dynamic environments, such as augmented reality and autonomous navigation. TextBoxes++ paves the way toward integrating text recognition more profoundly with detection tasks, enhancing overall system robustness.

Future research can extend this work by addressing limitations, like handling cases of diverse character spacing and occlusion. Further improvements in text recognition integration will also refine detection accuracy, fostering advancements in scene text understanding in AI systems.

This paper contributes to the advancement of scene text detection through a comprehensive approach that balances accuracy and efficiency, supporting future explorations in robust, real-time text detection systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.