Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval (2312.00414v2)
Abstract: In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-LLMs (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N2}$ and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.
- Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
- Dense events grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 920–928, 2021.
- Dynamic image networks for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3034–3042, 2016.
- Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10638–10647, 2020.
- The representation and recognition of action using temporal templates. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 1997.
- Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 20(12):3377–3388, 2018.
- Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Partially relevant video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 246–257, 2022.
- Dual learning with dynamic knowledge distillation for partially relevant video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11302–11312, 2023.
- Can an image classifier suffice for action recognition? In International Conference on Learning Representations, 2022.
- Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023a.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023b.
- Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 2023.
- Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3826–3834, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, pages 2790–2799, 2019.
- Lightweight attentional feature fusion: A new baseline for text-to-video retrieval. In Proceedings of the 17th European Conference on Computer Vision, 2022.
- Language is not all you need: Aligning perception with language models. In Advances in Neural Information Processing Systems, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Dense-captioning events in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 706–715, 2017.
- TVR: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision, pages 447–463, 2020.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
- Revealing single frame bias for video-and-language learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 487–507, 2023.
- Simple baselines for interactive video retrieval with questions and answers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11091–11101, 2023.
- Progressive semantic matching for video-text retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5083–5091, 2021a.
- RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021b.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889, 2020.
- Condensing a sequence to one informative frame for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16311–16320, 2021.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763, 2021.
- What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023.
- Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1979–1988, 2019.
- Avd: Adversarial video distillation. arXiv preprint arXiv:1907.05640, 2019a.
- Awsd: Adaptive weighted spatiotemporal distillation for video representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8020–8029, 2019b.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023.
- Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Transactions on Multimedia, 2020.
- Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
- An early evaluation of gpt-4v(ision). arXiv preprint arXiv:2310.16534, 2023.
- VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, 2021.
- TALL: Thumbnail layout for deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023.
- Multi-event video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22113–22123, 2023.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7590–7598, 2018.