VKIE: The Application of Key Information Extraction on Video Text (2310.11650v2)
Abstract: Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.
- CGTN Sports Scene. 2023. Messi on kissing the world cup trophy and his regrets at behavior against the netherlands argentina. https://www.youtube.com/watch?v=MhnT_aHkgSQ.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
- Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
- Do convolutional networks need to be deep for text classification? In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.
- Improving convolutional neural network for text classification by recursive data pruning. Neurocomputing, 414:143–152.
- Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1912–1920.
- Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5909–5918.
- N Radha. 2016. Video retrieval using speech and text in video. In 2016 International Conference on Inventive Computation Technologies (ICICT), volume 2, pages 1–6. IEEE.
- Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. arXiv preprint cs/9907006.
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Detecting text in natural image with connectionist text proposal network. In European conference on computer vision, pages 56–72. Springer.
- Revisiting multi-task learning in the deep learning era. arXiv preprint arXiv:2004.13379, 2(3).
- Attention is all you need. arXiv preprint arXiv:1706.03762.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
- Online video recommendation based on multimodal fusion and relevance feedback. In Proceedings of the 6th ACM international conference on Image and video retrieval, pages 73–80.
- Lecture video indexing and analysis using video ocr technology. In 2011 Seventh International Conference on Signal Image Technology & Internet-Based Systems, pages 54–61. IEEE.
- East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.