Video Text Tracking for Dense and Small Text Based on PP-YOLOE-R and Sort Algorithm
The paper "Video Text Tracking for Dense and Small Text Based on PP-YOLOE-R and Sort Algorithm" presents a robust approach to addressing the challenges associated with detecting and tracking dense and small text in high-resolution videos. This research is central to the domain of automated text spotting in video content, an area that necessitates precision due to the text being small, densely packed, and often obscured by various image artifacts.
A significant challenge identified in this work is the computational burden imposed by end-to-end video text spotting models, especially those reliant on Transformers, which are known for their long-range dependency modeling. These models become computationally prohibitive as the resolution of input video frames increases. The authors propose a two-stage pipeline that sidesteps heavy computation while preserving text spatial correlation.
The methodology is structured into two primary tasks: text detection and text tracking. For text detection, the paper employs PP-YOLOE-R, an efficient and effective anchor-free model devised specifically for small object detection. This model garnered attention for achieving a mean Average Precision (mAP) of 78.14 when benchmarked on the DOTA 1.0 dataset, a widely recognized dataset for small objects in aerial images.
For the tracking component, the authors utilize the SORT algorithm, noted for its simplicity and rapid inference speed in multiple object tracking scenarios. The combination of PP-YOLOE-R for detection and SORT for tracking forms a synergized approach that reportedly excels in both performance and speed, as tested on the ICDAR2023 DSText dataset. This dataset encapsulates a diverse range of scenarios, providing a comprehensive benchmark for the proposed method.
The experimental setup utilized high-performance Tesla V100 GPUs and the Paddle deep learning platform, where the PP-YOLOE-R model underwent rigorous training. Noteworthy attention was given to data augmentation techniques such as random image flips and rotated transformations, which are critical in enhancing the model's robustness to various text orientations and perspectives encountered in video frames.
The empirical results, supplemented by visualizations, underscore the method's efficacy across different scenarios, including gaming, driving, and street views. These tailored visualizations illustrate the consistent detection and tracking performance, marked by clear trace identification across consecutive frames.
In concluding, this research asserts the viability of decomposing the dense and small text detection problem into manageable sub-tasks, allowing for focused optimization on small object detection without the need for semantic understanding, which proves computationally expensive and less effective in dense text scenarios. This approach not only streamlines the text tracking process in high-resolution videos but also opens avenues for further exploration in optimizing component algorithms for enhanced text spotting in diverse video contexts. Future developments may explore the integration of more sophisticated tracking algorithms or the application of this model to other domains involving small object dynamics over time.