LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model (2405.19194v2)
Abstract: Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, recent methods track the zero-shot results of state-of-the-art image text spotters directly, and achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO, an innovative framework designed to enhance the performance of conventional text spotters. To achieve this goal, we design a language synergy classifier (LSC) to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Moreover, the glyph supervision is introduced to enhance the recognition accuracy of noisy text regions. In addition, we propose the visual position mixture module, which can merge the position information and visual features efficiently, and acquire more discriminative tracking features. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.
- Roadtext-1k: Text detection & recognition dataset for driving videos. In ICRA, pages 11074–11080, 2020.
- Unsupervised learning of video representations using lstms. In ICML, page 843–852, 2015.
- Dual encoding for video retrieval by text. TPAMI, 44(8):4065–4080, 2022.
- End-to-end video text spotting with transformer. arXiv preprint arXiv:2203.10539, 2022.
- Real-time end-to-end video text spotter with contrastive representation learning. arXiv preprint arXiv:2207.08417, 2022.
- End-to-end scene text recognition in videos based on multi frame tracking. In ICDAR, volume 01, pages 1255–1260, 2017.
- A bilingual, openworld video text dataset and end-to-end video text spotter with transformer. In NeurIPS, 2021.
- Scene text recognition in multiple frames based on text tracking. In ICME, pages 1–6, 2014.
- Multi-strategy tracking based text detection in scene videos. In ICDAR, pages 66–70, 2015.
- Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In CVPR, pages 9806–9815, 2020.
- Synthetic data for text localisation in natural images. In CVPR, pages 2315–2324, 2016.
- Gomatching: A simple baseline for video text spotting via long and short term matching. arXiv preprint arXiv:2401.07080, 2024.
- Deepsolo: Let transformer decoder with explicit points solo for text spotting. In CVPR, pages 19348–19357, 2023.
- Bytetrack: Multi-object tracking by associating every detection box. In ECCV, 2022.
- Evaluating multiple object tracking performance: The clear mot metrics. JIVP, 2008:1–10, 2008.
- Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pages 17–35, 2016.
- Icdar 2023 competition on video text reading for dense and small text. In ICDAR, pages 405–419, 2023.
- Pp-yoloe-r: An efficient anchor-free rotated object detector. arXiv preprint arXiv:2211.02386, 2022.
- Textboxes: A fast text detector with a single deep neural network. In AAAI, page 4161–4167, 2017.
- East: An efficient and accurate scene text detector. In CVPR, pages 2642–2651, 2017.
- Arbitrary-oriented scene text detection via rotation proposals. TMM, 20(11):3111–3122, 2018.
- Rotation-sensitive regression for oriented scene text detection. In CVPR, pages 5909–5918, 2018.
- Mask is all you need: Rethinking mask r-cnn for dense and arbitrary-shaped scene text detection. In ACM MM, pages 414–423, 2021.
- Pixellink: Detecting scene text via instance segmentation. In AAAI, page 6773–6780, 2018.
- Shape robust text detection with progressive scale expansion network. In CVPR, pages 9328–9337, 2019.
- Real-time scene text detection with differentiable binarization and adaptive scale fusion. TPAMI, 45(1):919–931, 2023.
- Fourier contour embedding for arbitrary-shaped text detection. In CVPR, pages 3122–3130, 2021.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI, 39(11):2298–2304, 2017.
- Reading scene text in deep convolutional sequences. In AAAI, pages 3501–3508, 2016.
- Accurate recognition of words in scenes without character segmentation using recurrent neural network. PR, 63:397–405, 2017.
- Textscanner: Reading characters in order for robust scene text recognition. In AAAI, pages 12120–12127, 2020.
- Show, attend and read: a simple and strong baseline for irregular text recognition. In AAAI, page 8610–8617, 2019.
- Aon: Towards arbitrarily-oriented text recognition. In CVPR, pages 5571–5579, 2018.
- Esir: End-to-end scene text recognition via iterative image rectification. In CVPR, pages 2054–2063, 2019.
- Symmetry-constrained rectification network for scene text recognition. In ICCV, pages 9146–9155, 2019.
- Self-supervised implicit glyph attention for text recognition. In CVPR, pages 15285–15294, 2023.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In CVPR, pages 7094–7103, 2021.
- Fots: Fast oriented text spotting with a unified network. In CVPR, pages 5676–5685, 2018.
- An end-to-end textspotter with explicit alignment and attention. In CVPR, pages 5020–5029, 2018.
- Textdragon: An end-to-end framework for arbitrary shaped text spotting. In ICCV, pages 9075–9084, 2019.
- Text spotting transformers. In CVPR, pages 9509–9518, 2022.
- Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In CVPR, pages 4583–4593, 2022.
- Estextspotter: Towards better scene text spotting with explicit synergy in transformer. In ICCV, pages 19438–19448, 2023.
- Convolutional character networks. In ICCV, pages 9125–9135, 2019.
- Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. TPAMI, 43(2):532–548, 2021.
- Snoopertrack: Text detection and tracking for outdoor videos. In ICIP, pages 505–508, 2011.
- Semantic-aware video text detection. In CVPR, pages 1695–1705, 2021.
- Video text tracking with a spatio-temporal complementary model. TIP, 30:9321–9331, 2021.
- You only recognize once: Towards fast video text spotting. ACM MM, pages 855–863, 2019.
- Free: A fast and robust end-to-end video text spotter. TIP, 30:822–837, 2021.
- Motr: End-to-end multiple-object tracking with transformer. In ECCV, pages 659–675, 2022.
- Varifocalnet: An iou-aware dense object detector. In CVPR, pages 8510–8519, 2021.
- Gaussian bounding boxes and probabilistic intersection-over-union for object detection. arXiv preprint arXiv:2106.06072, 2021.
- Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In NeurIPS, pages 21002–21012, 2020.
- Icdar 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013.
- Icdar 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015.
- Detecting text in natural scenes with stroke width transform. In CVPR, pages 2963–2970, 2010.
- Text from corners: A novel approach to detect text and caption in videos. TIP, 20(3):790–799, 2011.
- Robust text detection in natural scene images. TPAMI, 36(5):970–983, 2014.
- Arbitrarily-oriented multi-lingual text detection in video. MTA, 76(15):16625–16655, 2017.
- Scene text detection and tracking in video with background cues. In ICMR, page 160–168, 2018.
- Fractals based multi-oriented text detection system for recognition in mobile video images. PR, 68:158–174, 2017.
- A new technique for multi-oriented scene text line detection and tracking in video. TMM, 17(8):1137–1152, 2015.
- End-to-end video text detection with online tracking. PR, 113:107791, 2021.
- Scene video text tracking with graph matching. Access, 6:19419–19426, 2018.
- Contrastive learning of semantic and visual representations for text tracking. arXiv preprint arXiv:2112.14976, 2021.
- Scene text detection via connected component clustering and nontext filtering. TIP, 22(6):2296–2305, 2013.
- Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
- Synthetic data and artificial neural networks for natural scene text recognition. In NIPS Workshop, 2014.
- Hongen Liu (3 papers)
- Yi Liu (543 papers)
- Di Sun (8 papers)
- Jiahao Wang (88 papers)
- Gang Pan (94 papers)