End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling (2407.15047v2)
Abstract: Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Pavlos S Efraimidis and Paul G Spirakis. 2006. Weighted random sampling with a reservoir. Information processing letters, 97(5):181–185.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783.
- An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram. International Journal of Multimedia Information Retrieval, 5:89–104.
- Cheng Huang and Hongmei Wang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology, 30(2):577–589.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766.
- Mmft-bert: Multimodal fusion transformer with bert encodings for visual question answering. arXiv preprint arXiv:2010.14095.
- Dense-caption matching and frame-selection gating for temporal localization in videoqa. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4812–4822.
- Self-supervised pre-training and contrastive representation learning for multiple-choice video qa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13171–13179.
- Large language models are temporal and causal reasoners for video question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4300–4316.
- Key frame extraction for salient activity recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 835–840. IEEE.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341.
- Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Retrieving-to-answer: Zero-shot video question answering with frozen large language models. arXiv preprint arXiv:2306.11732.
- Recent advances in video question answering: A review of datasets and methods. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part II, pages 339–356. Springer.
- An efficient keyframes selection based framework for video captioning. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 240–250.
- Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing, 331:424–433.
- Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191.
- Lstp: Language-guided spatial-temporal prompt learning for long-form video-text understanding. arXiv preprint arXiv:2402.16050.
- Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2).
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786.
- Sang Michael Xie and Stefano Ermon. 2019. Reparameterizable subset sampling via continuous relaxations. arXiv preprint arXiv:1901.10517.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653.
- Self-supervised learning to detect key frames in videos. Sensors, 20(23):6941.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1686–1697.
- Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1556–1565.
- Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416.
- Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
- Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, volume 2, page 8.
- Video question answering: Datasets, algorithms and challenges. In The 2022 Conference on Empirical Methods in Natural Language Processing.
- Jianxin Liang (7 papers)
- Xiaojun Meng (23 papers)
- Yueqian Wang (11 papers)
- Chang Liu (863 papers)
- Qun Liu (230 papers)
- Dongyan Zhao (144 papers)