Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering (2401.10711v4)

Published 19 Jan 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we first fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels, with the visual-language alignment capability of the CLIP models. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Learning with differentiable pertubed optimizers. In NeurIPS, 2020.
  3. Language models are few-shot learners. In NeurIPS, 2020.
  4. Revisiting the” video” in video-language understanding. In CVPR, 2022.
  5. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023.
  6. Cosa: Concatenated sample pretrained vision-language foundation model. arXiv preprint arXiv:2306.09085, 2023.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  10. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007, 2019.
  11. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  12. Tall: Temporal activity localization via language query. In ICCV, 2017.
  13. Motion-appearance co-memory networks for video question answering. In CVPR, 2018.
  14. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2023.
  15. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  16. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
  17. Reasoning with heterogeneous graph alignment for video question answering. In AAAI, 2020.
  18. Hierarchical conditional relation networks for video question answering. In CVPR, 2020.
  19. Detecting moments and highlights in videos via natural language queries. In NeurIPS, 2021.
  20. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In CVPR, 2022.
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  22. Invariant grounding for video question answering. In CVPR, 2022.
  23. Tg-vqa: Ternary game of video question answering. In IJCAI, 2023.
  24. Intentqa: Context-aware video intent reasoning. In ICCV, 2023.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  26. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023.
  27. Discovering spatio-temporal rationales for video question answering. In ICCV, 2023.
  28. Visual instruction tuning. In NeurIPS, 2023.
  29. Bridge to answer: Structure-aware graph interaction network for video question answering. In CVPR, 2021.
  30. Multilevel hierarchical network with multiscale sampling for video question answering. In IJCAI, 2022.
  31. Learning transferable visual models from natural language supervision. In ICML, 2021.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  33. Attention is all you need. In NeurIPS, 2017.
  34. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  35. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  36. Video as conditional graph hierarchy for multi-granular question answering. In CVPR, 2022.
  37. Video graph transformer for video question answering. In ECCV, 2022.
  38. Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327, 2023.
  39. Video question answering via gradually refined attention over appearance and motion. In ACM MM, 2017.
  40. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
  41. Multi: Efficient video-and-language understanding with multiway-sampler and multiple choice modeling. arXiv preprint arXiv:2303.05707, 2023.
  42. Just ask: Learning to answer questions from millions of narrated videos. In CVPR, 2021.
  43. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
  44. Hitea: Hierarchical temporal-aware video-language pre-training. arXiv preprint arXiv:2212.14546, 2022.
  45. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  46. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
  47. Self-chained image-language model for video localization and question answering. In NeurIPS, 2023.
  48. Discovering the real association: Multimodal causal reasoning in video question answering. In CVPR, 2023.
  49. Temporal sentence grounding in videos: A survey and future directions. IEEE TPAMI, 2023.
  50. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haibo Wang (50 papers)
  2. Chenghang Lai (3 papers)
  3. Yixuan Sun (25 papers)
  4. Weifeng Ge (29 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets