MoReVQA: Exploring Modular Reasoning Models for Video Question Answering (2404.06511v1)
Abstract: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
- Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016.
- ViViT: A Video Vision Transformer. In ICCV, 2021.
- Test of time: Instilling video-language models with a sense of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2503–2516, 2023.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.
- Revisiting the “video” in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023a.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
- Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Scenic: A jax library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21393–21398, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Violet : End-to-end video-language transformers with masked visual-token modeling. arXiv:2111.12681, 2021.
- Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1675–1685, 2021.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
- ImageBind: One embedding space to bind them all. In CVPR, 2023.
- Palm 2 technical report, 2023.
- Ego4d: Around the world in 3,000 hours of egocentric video, 2022.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Avis: Autonomous visual information seeking with large language model agent. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Finding “it”: Weakly-supervised, reference-aware visual grounding in instructional videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11109–11116, 2020.
- Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision, pages 2989–2998, 2017.
- Analyzing modular approaches for visual question decomposition. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
- Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Videochat: Chat-centric video understanding, 2024.
- Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2928–2937, 2022.
- Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval. arXiv preprint arXiv:2206.02082, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
- Simple open-vocabulary object detection with vision transformers. European Conference on Computer Vision (ECCV), 2022.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019.
- Verbs in action: Improving verb understanding in video-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15579–15591, 2023.
- Gpt-4 technical report, 2024.
- A simple recipe for contrastively pre-training video-first encoders beyond 16 frames, 2023.
- Learning transferable visual models from natural language supervision. In icml, 2021.
- End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392, 2023.
- Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
- Gemini: A family of highly capable multimodal models, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- All in one: Exploring unified video-language pre-training, 2022a.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022b.
- End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022c.
- Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483–8497, 2022d.
- Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
- Video graph transformer for video question answering. In European Conference on Computer Vision, pages 39–58. Springer, 2022.
- Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327, 2023.
- Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
- Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1686–1697, 2021.
- Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023a.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023b.
- Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023a.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023b.
- Keunwoo Peter Yu. VideoBLIP, 2023.
- Self-chained image-language model for video localization and question answering. In NeurIPS, 2023.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019.
- Socratic models: Composing zero-shot multimodal reasoning with language. ICLR, 2023.
- Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In EMNLP 2023 Demo, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ICLR, 2024.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Video question answering: Datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225, 2022.
- End-to-end dense video captioning with masked transformer. In CVPR, 2018.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.