Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm (2304.00018v1)

Published 31 Mar 2023 in cs.CV
Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm

Abstract: Although end-to-end video text spotting methods based on Transformer can model long-range dependencies and simplify the train process, it will lead to large computation cost with the increase of the frame size in the input video. Therefore, considering the resolution of ICDAR 2023 DSText is 1080 * 1920 and slicing the video frame into several areas will destroy the spatial correlation of text, we divided the small and dense text spotting into two tasks, text detection and tracking. For text detection, we adopt the PP-YOLOE-R which is proven effective in small object detection as our detection model. For text detection, we use the sort algorithm for high inference speed. Experiments on DSText dataset demonstrate that our method is competitive on small and dense text spotting.

Video Text Tracking for Dense and Small Text Based on PP-YOLOE-R and Sort Algorithm

The paper "Video Text Tracking for Dense and Small Text Based on PP-YOLOE-R and Sort Algorithm" presents a robust approach to addressing the challenges associated with detecting and tracking dense and small text in high-resolution videos. This research is central to the domain of automated text spotting in video content, an area that necessitates precision due to the text being small, densely packed, and often obscured by various image artifacts.

A significant challenge identified in this work is the computational burden imposed by end-to-end video text spotting models, especially those reliant on Transformers, which are known for their long-range dependency modeling. These models become computationally prohibitive as the resolution of input video frames increases. The authors propose a two-stage pipeline that sidesteps heavy computation while preserving text spatial correlation.

The methodology is structured into two primary tasks: text detection and text tracking. For text detection, the paper employs PP-YOLOE-R, an efficient and effective anchor-free model devised specifically for small object detection. This model garnered attention for achieving a mean Average Precision (mAP) of 78.14 when benchmarked on the DOTA 1.0 dataset, a widely recognized dataset for small objects in aerial images.

For the tracking component, the authors utilize the SORT algorithm, noted for its simplicity and rapid inference speed in multiple object tracking scenarios. The combination of PP-YOLOE-R for detection and SORT for tracking forms a synergized approach that reportedly excels in both performance and speed, as tested on the ICDAR2023 DSText dataset. This dataset encapsulates a diverse range of scenarios, providing a comprehensive benchmark for the proposed method.

The experimental setup utilized high-performance Tesla V100 GPUs and the Paddle deep learning platform, where the PP-YOLOE-R model underwent rigorous training. Noteworthy attention was given to data augmentation techniques such as random image flips and rotated transformations, which are critical in enhancing the model's robustness to various text orientations and perspectives encountered in video frames.

The empirical results, supplemented by visualizations, underscore the method's efficacy across different scenarios, including gaming, driving, and street views. These tailored visualizations illustrate the consistent detection and tracking performance, marked by clear trace identification across consecutive frames.

In concluding, this research asserts the viability of decomposing the dense and small text detection problem into manageable sub-tasks, allowing for focused optimization on small object detection without the need for semantic understanding, which proves computationally expensive and less effective in dense text scenarios. This approach not only streamlines the text tracking process in high-resolution videos but also opens avenues for further exploration in optimizing component algorithms for enhanced text spotting in diverse video contexts. Future developments may explore the integration of more sophisticated tracking algorithms or the application of this model to other domains involving small object dynamics over time.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Hongen Liu (3 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com