RTQ: Rethinking Video-language Understanding Based on Image-text Model (2312.00347v2)
Abstract: Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.
- Flamingo: a Visual Language Model for Few-Shot Learning. In Advances in Neural Information Processing Systems. Curran Associates, 23716–23736.
- Localizing moments in video with natural language. In International Conference on Computer Vision. IEEE, 5803–5812.
- Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In Conference on Computer Vision and Pattern Recognition. IEEE, 1708–1718.
- Revisiting the ”Video” in Video-Language Understanding. In Conference on Computer Vision and Pattern Recognition. IEEE, 2907–2917.
- David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics, 190–200.
- Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. arXiv:2109.04290
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 4171–4186.
- Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 19405–19414.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. OpenReview.net.
- Multi-queue Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of International Conference on Web Search and Data Mining. ACM, 1003–1011.
- A Multi-sensor Framework for Personal Presentation Analytics. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 30:1–30:21.
- CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 14815–14824.
- Bridging Video-Text Retrieval With Multiple Choice Questions. In Conference on Computer Vision and Pattern Recognition. IEEE, 16167–16176.
- X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. In Conference on Computer Vision and Pattern Recognition. IEEE, 4996–5005.
- Text with Knowledge Graph Augmented Transformer for Video Captioning. In Conference on Computer Vision and Pattern Recognition. IEEE, 1175–1175.
- Dense-captioning events in videos. In International Conference on Computer Vision. IEEE, 706–715.
- Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In Conference on Computer Vision and Pattern Recognition. IEEE, 7331–7341.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning. PMLR, 12888–12900.
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems. Curran Associates, 9694–9705.
- UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. In International Conference on Learning Representations. OpenReview.net.
- Invariant Grounding for Video Question Answering. In Conference on Computer Vision and Pattern Recognition. IEEE, 2918–2927.
- TSM: Temporal Shift Module for Efficient Video Understanding. In International Conference on Computer Vision. IEEE, 7082–7092.
- SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. In Conference on Computer Vision and Pattern Recognition. IEEE, 17928–17937.
- Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring. In Conference on Computer Vision and Pattern Recognition. IEEE, 6422–6431.
- TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. In European Conference on Computer Vision. Springer, 319–335.
- UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
- Expanding Language-Image Pretrained Models for General Video Recognition. In European Conference on Computer Vision, Vol. 13664. Springer, 1–18.
- Search-oriented Micro-video Captioning. In International Conference on Multimedia. ACM, 3234–3243.
- Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR Conference on Research and Development in Information Retrieval. ACM, 1104–1113.
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- End-to-end Generative Pretraining for Multimodal Video Captioning. In Conference on Computer Vision and Pattern Recognition. IEEE, 17938–17947.
- CLIP4Caption: CLIP for Video Caption. In International Conference on Multimedia. ACM, 4858–4862.
- OmniVL:One Foundation Model for Image-Language and Video-Language Tasks. In Advances in Neural Information Processing Systems. Curran Associates.
- Discover Micro-Influencers for Brands via Better Understanding. IEEE Transactions on Multimedia 24 (2022), 2595–2605.
- Micro-video Tagging via Jointly Modeling Social Influence and Tag Relation. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 4478–4486.
- Neural multimodal cooperative learning toward micro-video understanding. Transactions on Image Processing 29 (2019), 1–14.
- Neighbor-Guided Consistent and Contrastive Learning for Semi-Supervised Action Recognition. IEEE Transactions on Image Processing 32 (2023), 2215–2227.
- Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Conference on Computer Vision and Pattern Recognition. IEEE, 1618–1628.
- NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In Conference on Computer Vision and Pattern Recognition. IEEE, 9777–9786.
- Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. In AAAI Conference on Artificial Intelligence. AAAI Press, 2804–2812.
- Contrastive Video Question Answering via Video Graph Transformer. arXiv:2302.13668
- Video question answering via gradually refined attention over appearance and motion. In International Conference on Multimedia. ACM, 1645–1653.
- mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv:2302.00402
- Msr-vtt: A large video description dataset for bridging video and language. In Conference on Computer Vision and Pattern Recognition. IEEE, 5288–5296.
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. In International Conference on Learning Representations. OpenReview.net.
- VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. arXiv:2212.04979
- CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter. In Chinese Conference of Pattern Recognition and Computer Vision. Springer, 368–381.
- Hierarchical Modular Network for Video Captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 17918–17927.
- HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training. arXiv:2212.14546
- MERLOT: Multimodal Neural Script Knowledge Models. In Advances in Neural Information Processing Systems. Curran Associates, 23634–23651.
- CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. In International SIGIR Conference on Research and Development in Information Retrieval. ACM, 970–981.
- Graph Contrastive Clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 9204–9213.
- STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training. arXiv:2302.09736