ViLA: Efficient Video-Language Alignment for Video Question Answering (2312.08367v4)
Abstract: In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-LLMs have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-LLM to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up. The code will be available at https://github.com/xijun-cs/ViLA.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Noise estimation using density estimation for self-supervised multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6644–6652, 2021.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Prompt switch: Efficient clip adaptation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15648–15658, October 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
- Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
- Video owl-vit: Temporally-consistent open-world localization in video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13802–13811, October 2023.
- Distilling the knowledge in a neural network. Advances in Neural Information Processing Systems (NIPS), 2015.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11101–11108, 2020.
- Real-time video inference on edge devices via adaptive model streaming. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4572–4582, 2021.
- Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018.
- What is more likely to happen next? video-and-language future event prediction. arXiv preprint arXiv:2010.07999, 2020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Uniformerv2: Unlocking the potential of image vits for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1632–1643, October 2023.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
- Smaug: Sparse masked autoencoder for efficient video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2459–2469, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Ts2-net: Token shift and selection transformer for text-video retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Scott MacFarland. If a picture is worth a thousand words, what is a video worth?, 2014.
- Online model distillation for efficient video inference. In Proceedings of the IEEE/CVF International conference on computer vision, pages 3573–3582, 2019.
- Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
- St-adapter: Parameter-efficient image-to-video transfer learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 26462–26477. Curran Associates, Inc., 2022.
- Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
- Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13934–13944, October 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- A video question answering model based on knowledge distillation. Information, 14(6):328, 2023.
- Auxiliary modality learning with generalized curriculum distillation. 2023.
- Small-shot multi-modal distillation for vision-based autonomous steering. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7763–7770. IEEE, 2023.
- Llama: open and efficient foundation language models, 2023. URL https://arxiv. org/abs/2302.13971.
- All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023.
- Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Triplet knowledge distillation. arXiv preprint arXiv:2305.15975, 2023.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Unified coarse-to-fine alignment for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2816–2827, October 2023.
- Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
- Clip-vip: Adapting pre-trained image-text model to video-language representation alignment, 2023.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1686–1697, 2021.
- Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
- Learning trajectory-word alignments for video-language tasks. arXiv preprint arXiv:2301.01953, 2023.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Hitea: Hierarchical temporal-aware video-language pre-training. arXiv preprint arXiv:2212.14546, 2022.
- Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 829–832, 2017.
- Coca: Contrastive captioners are image-text foundation models. arxiv 2022. arXiv preprint arXiv:2205.01917.
- Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
- Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
- Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
- Deep mutual learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4320–4328, 2018.
- Xijun Wang (64 papers)
- Junbang Liang (10 papers)
- Chun-Kai Wang (4 papers)
- Kenan Deng (4 papers)
- Yu Lou (6 papers)
- Ming Lin (65 papers)
- Shan Yang (58 papers)