Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning (2401.00701v1)
Abstract: In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 1728–1738.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 190–200.
- Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, 10638–10647.
- CLIP-Art: Contrastive pre-training for fine-grained art classification. In CVPR, 3956–3960.
- Teachtext: Crossmodal generalized distillation for text-video retrieval. In ICCV, 11583–11593.
- Clip2video: Mastering video-text retrieval via image clip. Arxiv.
- Multi-modal transformer for video retrieval. In ECCV, 214–229.
- Bridging video-text retrieval with multiple choice questions. In CVPR, 16167–16176.
- X-pool: Cross-modal language-video attention for text-video retrieval. In CVPR, 5006–5015.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 961–970. IEEE.
- Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, 444–461.
- Scaling up vision-language pre-training for image captioning. In CVPR, 17980–17989.
- Knowledge distillation from a stronger teacher. arXiv.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 12888–12900.
- W2vv++ fully deep learning for ad-hoc video search. In ACMMM, 1786–1794.
- Sea: Sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia, 23: 4351–4362.
- Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV, 11915–11925.
- Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, 319–335.
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In ACMMM, 638–647.
- Clipcap: Clip prefix for image captioning. arXiv.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv.
- Learning transferable visual models from natural language supervision. In ICML, 8748–8763.
- Hierarchical text-conditional image generation with clip latents. arXiv.
- Disentangled representation learning for text-video retrieval. arXiv.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 4581–4591.
- Boosting cross-modal retrieval With MVSE++ and reciprocal neighbors. IEEE Access, 8: 84642–84651.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 5288–5296.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 5036–5045.
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. ICML.
- CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. In SIGIR.
- Kaibin Tian (7 papers)
- Yanhua Cheng (10 papers)
- Yi Liu (543 papers)
- Xinglin Hou (6 papers)
- Quan Chen (91 papers)
- Han Li (182 papers)