Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling (2407.15047v2)

Published 21 Jul 2024 in cs.CV and cs.CL

Abstract: Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  2. Pavlos S Efraimidis and Paul G Spirakis. 2006. Weighted random sampling with a reservoir. Information processing letters, 97(5):181–185.
  3. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
  4. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783.
  5. An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram. International Journal of Multimedia Information Retrieval, 5:89–104.
  6. Cheng Huang and Hongmei Wang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology, 30(2):577–589.
  7. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766.
  8. Mmft-bert: Multimodal fusion transformer with bert encodings for visual question answering. arXiv preprint arXiv:2010.14095.
  9. Dense-caption matching and frame-selection gating for temporal localization in videoqa. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4812–4822.
  10. Self-supervised pre-training and contrastive representation learning for multiple-choice video qa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13171–13179.
  11. Large language models are temporal and causal reasoners for video question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4300–4316.
  12. Key frame extraction for salient activity recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 835–840. IEEE.
  13. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341.
  14. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
  15. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  16. Retrieving-to-answer: Zero-shot video question answering with frozen large language models. arXiv preprint arXiv:2306.11732.
  17. Recent advances in video question answering: A review of datasets and methods. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part II, pages 339–356. Springer.
  18. An efficient keyframes selection based framework for video captioning. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 240–250.
  19. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing, 331:424–433.
  20. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640.
  21. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  22. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608.
  23. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191.
  24. Lstp: Language-guided spatial-temporal prompt learning for long-form video-text understanding. arXiv preprint arXiv:2402.16050.
  25. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2).
  26. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786.
  27. Sang Michael Xie and Stefano Ermon. 2019. Reparameterizable subset sampling via continuous relaxations. arXiv preprint arXiv:1901.10517.
  28. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653.
  29. Self-supervised learning to detect key frames in videos. Sensors, 20(23):6941.
  30. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1686–1697.
  31. Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1556–1565.
  32. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416.
  33. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988.
  34. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  35. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, volume 2, page 8.
  36. Video question answering: Datasets, algorithms and challenges. In The 2022 Conference on Empirical Methods in Natural Language Processing.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jianxin Liang (7 papers)
  2. Xiaojun Meng (23 papers)
  3. Yueqian Wang (11 papers)
  4. Chang Liu (863 papers)
  5. Qun Liu (230 papers)
  6. Dongyan Zhao (144 papers)