Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GloTSFormer: Global Video Text Spotting Transformer (2401.03694v1)

Published 8 Jan 2024 in cs.CV and cs.AI

Abstract: Video Text Spotting (VTS) is a fundamental visual task that aims to predict the trajectories and content of texts in a video. Previous works usually conduct local associations and apply IoU-based distance and complex post-processing procedures to boost performance, ignoring the abundant temporal information and the morphological characteristics in VTS. In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames. Our main contributions can be summarized as three folds. 1). We propose a Transformer-based global tracking method GloTSFormer for VTS and associate multiple frames simultaneously. 2). We introduce a Wasserstein distance-based method to conduct positional associations between frames. 3). We conduct extensive experiments on public datasets. On the ICDAR2015 video dataset, GloTSFormer achieves 56.0 MOTA with 4.6 absolute improvement compared with the previous SOTA method and outperforms the previous Transformer-based method by a significant 8.3 MOTA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  2. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10337–10346, 2020.
  3. You only recognize once: Towards fast video text spotting. In Proceedings of the 27th ACM International Conference on Multimedia, pages 855–863, 2019.
  4. Free: A fast and robust end-to-end video text spotter. IEEE Transactions on Image Processing, 30:822–837, 2020.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  6. Relation distillation networks for video object detection. In ICCV, pages 7023–7032, 2019.
  7. Semantic-aware video text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1695–1705, 2021.
  8. Video text tracking with a spatio-temporal complementary model. IEEE Transactions on Image Processing, 30:9321–9331, 2021.
  9. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  10. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  11. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
  12. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  14. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  15. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
  16. Icdar 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition, pages 1484–1493. IEEE, 2013.
  17. Scene text detection via connected component clustering and nontext filtering. IEEE transactions on image processing, 22(6):2296–2305, 2013.
  18. Brian Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.
  19. Contrastive learning of semantic and visual representations for text tracking. arXiv preprint arXiv:2112.14976, 2021.
  20. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  21. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5676–5685, 2018.
  22. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  23. Snoopertrack: Text detection and tracking for outdoor videos. In 2011 18th IEEE international conference on image processing, pages 505–508. IEEE, 2011.
  24. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pages 17–35. Springer, 2016.
  25. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  26. Leveraging long-range temporal relationships between proposals for video object detection. In ICCV, pages 9756–9764, 2019.
  27. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
  28. Detecting text in natural image with connectionist text proposal network. In European conference on computer vision, pages 56–72. Springer, 2016.
  29. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
  30. Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. arXiv preprint arXiv:2209.02242, 2022.
  31. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9336–9345, 2019.
  32. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5349–5367, 2021.
  33. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8440–8449, 2019.
  34. Towards real-time multi-object tracking. In European Conference on Computer Vision, pages 107–122. Springer, 2020.
  35. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  36. Sequence level semantics aggregation for video object detection. In ICCV, pages 9217–9225, 2019.
  37. A bilingual, openworld video text dataset and end-to-end video text spotter with transformer. arXiv preprint arXiv:2112.04888, 2021.
  38. Real-time end-to-end video text spotter with contrastive representation learning. arXiv preprint arXiv:2207.08417, 2022.
  39. End-to-end video text spotting with transformer. arXiv preprint arXiv:2203.10539, 2022.
  40. Rethinking rotated object detection with gaussian wasserstein distance loss. In International Conference on Machine Learning, pages 11830–11841. PMLR, 2021.
  41. End-to-end video text detection with online tracking. Pattern Recognition, 113:107791, 2021.
  42. Motr: End-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247, 2021.
  43. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11):3069–3087, 2021.
  44. Deeptext: A unified framework for text proposal generation and text detection in natural images. arxiv 2016. arXiv preprint arXiv:1605.07314.
  45. Transvod: End-to-end video object detection with spatial-temporal transformers. arXiv preprint arXiv:2201.05047, 2022.
  46. Tracking objects as points. In European Conference on Computer Vision, pages 474–490. Springer, 2020.
  47. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
  48. Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8771–8780, 2022.
  49. Flow-guided feature aggregation for video object detection. In ICCV, pages 408–417, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Han Wang (420 papers)
  2. Yanjie Wang (18 papers)
  3. Yang Li (1142 papers)
  4. Can Huang (43 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com