Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching (2401.07080v2)

Published 13 Jan 2024 in cs.CV

Abstract: Beyond the text detection and recognition tasks in image text spotting, video text spotting presents an augmented challenge with the inclusion of tracking. While advanced end-to-end trainable methods have shown commendable performance, the pursuit of multi-task optimization may pose the risk of producing sub-optimal outcomes for individual tasks. In this paper, we identify a main bottleneck in the state-of-the-art video text spotter: the limited recognition capability. In response to this issue, we propose to efficiently turn an off-the-shelf query-based image text spotter into a specialist on video and present a simple baseline termed GoMatching, which focuses the training efforts on tracking while maintaining strong recognition performance. To adapt the image text spotter to video datasets, we add a rescoring head to rescore each detected instance's confidence via efficient tuning, leading to a better tracking candidate pool. Additionally, we design a long-short term matching module, termed LST-Matcher, to enhance the spotter's tracking capability by integrating both long- and short-term matching results via Transformer. Based on the above simple designs, GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo, which demonstrates GoMatching's capability to accommodate general, dense, small, arbitrary-shaped, Chinese and English text scenarios while saving considerable training budgets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022.
  2. Scene text recognition with permuted autoregressive sequence models. In European Conference on Computer Vision, pages 178–196. Springer, 2022.
  3. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
  4. Cascade r-cnn: High quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence, pages 1483–1498, 2019.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  6. You only recognize once: Towards fast video text spotting. In Proceedings of the 27th ACM International Conference on Multimedia, pages 855–863, 2019.
  7. Free: A fast and robust end-to-end video text spotter. IEEE Transactions on Image Processing, 30:822–837, 2020.
  8. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4065–4080, 2021.
  9. Video text tracking with a spatio-temporal complementary model. IEEE Transactions on Image Processing, pages 9321–9331, 2021.
  10. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  11. Estextspotter: Towards better scene text spotting with explicit synergy in transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19495–19505, 2023.
  12. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
  13. Towards weakly-supervised text spotting using a multi-task transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4604–4613, 2022.
  14. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  15. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 706–722. Springer, 2020.
  16. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020.
  17. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  18. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8048–8064, 2021.
  19. Spts v2: single-point scene text spotting. arXiv preprint arXiv:2301.01635, 2023.
  20. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  21. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pages 17–35. Springer, 2016.
  22. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
  23. End-to-end scene text recognition in videos based on multi frame tracking. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 1255–1260. IEEE, 2017.
  24. Towards real-time multi-object tracking. In European Conference on Computer Vision, pages 107–122. Springer, 2020.
  25. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5349–5367, 2021.
  26. A bilingual, openworld video text dataset and end-to-end video text spotter with transformer. arXiv preprint arXiv:2112.04888, 2021.
  27. End-to-end video text spotting with transformer. arXiv preprint arXiv:2203.10539, 2022.
  28. Real-time end-to-end video text spotter with contrastive representation learning. arXiv preprint arXiv:2207.08417, 2022.
  29. Icdar 2023 video text reading competition for dense and small text. arXiv preprint arXiv:2304.04376, 2023.
  30. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19348–19357, 2023.
  31. Deepsolo++: Let transformer decoder with explicit points solo for text spotting. arXiv preprint arXiv:2305.19957, 2023.
  32. Motrv3: Release-fetch supervision for end-to-end multi-object tracking. arXiv preprint arXiv:2305.14298, 2023.
  33. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision, pages 659–675. Springer, 2022.
  34. Character-level street view text spotting based on deep multisegmentation network for smarter autonomous driving. IEEE Transactions on Artificial Intelligence, 3(2):297–308, 2021.
  35. Text spotting transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9519–9528, 2022.
  36. Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision, pages 1–21. Springer, 2022.
  37. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22056–22065, 2023.
  38. Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8771–8780, 2022.
  39. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haibin He (7 papers)
  2. Maoyuan Ye (9 papers)
  3. Jing Zhang (730 papers)
  4. Juhua Liu (37 papers)
  5. Dacheng Tao (826 papers)
  6. Bo Du (263 papers)
Citations (1)