TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering (2404.01476v2)
Abstract: Recently, image-based Large Multimodal Models (LMMs) have made significant progress in video question-answering (VideoQA) using a frame-wise approach by leveraging large-scale pretraining in a zero-shot manner. Nevertheless, these models need to be capable of finding relevant information, extracting it, and answering the question simultaneously. Currently, existing methods perform all of these steps in a single pass without being able to adapt if insufficient or incorrect information is collected. To overcome this, we introduce a modular multi-LMM agent framework based on several agents with different roles, instructed by a Planner agent that updates its instructions using shared feedback from the other agents. Specifically, we propose TraveLER, a method that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the need to fine-tune on specific datasets. Our code is available at https://github.com/traveler-framework/TraveLER.
- Flamingo: a visual language model for few-shot learning, 2022.
- Neural module networks, 2017.
- Bringing image scene structure to video via frame-clip consistency of object tokens. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
- Memory consolidation enables long-context video understanding, 2024.
- Compositional video synthesis with action graphs. In ICML, 2021.
- Object level visual reasoning in videos. In European Conference on Computer Vision, pp. 105–121, 2018.
- Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- Language models are few-shot learners, 2020.
- Revisiting the ”video” in video-language understanding, 2022.
- Video chatcaptioner: Towards enriched spatiotemporal descriptions, 2023a.
- Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset, 2023b.
- Zero-shot video question answering with procedural programs, 2023.
- Scaling instruction-finetuned language models, 2022.
- Violet : End-to-end video-language transformers with masked visual-token modeling, 2022.
- Recursive visual programming, 2023.
- Visual programming: Compositional visual reasoning without training, 2022.
- Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2018.
- Spatio-temporal action graph networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0, 2019.
- Learning canonical representations for scene graph to image generation. In European Conference on Computer Vision, 2020.
- Object-region video transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Incorporating structured representations into pretrained vision \& language models using scene graphs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6803–6815, January 2024.
- Learning object detection from captions via textual scene attributes. ArXiv, abs/2009.14558, 2020.
- Referring relationships. European Conference on Computer Vision, 2018.
- Qvhighlights: Detecting moments and highlights in videos via natural language queries, 2021a.
- Less is more: Clipbert for video-and-language learning via sparse sampling, 2021b.
- Lavender: Unifying video-language understanding as masked language modeling, 2022.
- Video-llava: Learning united visual representation by alignment before projection, 2023a.
- Mm-vid: Advancing video understanding with gpt-4v(ision), 2023b.
- Lgdn: Language-guided denoising network for video-language modeling, 2022.
- Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
- A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering, 2017.
- Compositional chain-of-thought prompting for large multimodal models. ArXiv, abs/2311.17076, 2023.
- Verbs in action: Improving verb understanding in video-language models, 2023.
- A simple recipe for contrastively pre-training video-first encoders beyond 16 frames, 2023.
- Perception test: A diagnostic benchmark for multimodal video models, 2023.
- Locate before answering: Answer guided question localization for video question answering, 2023.
- Modular visual question answering via code generation, 2023.
- Long-form video-language pre-training with multimodal temporal contrastive learning, 2023.
- Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
- Internvideo: General video foundation models via generative and discriminative learning, 2022a.
- Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747, 2022b.
- Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition, 2019.
- Next-qa:next phase of question-answering to explaining temporal actions, 2021.
- Clip-vip: Adapting pre-trained image-text model to video-language representation alignment, 2023.
- Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, 26(12):5656–5666, 2017. doi: 10.1109/TIP.2017.2746267.
- Zero-shot video question answering via frozen bidirectional language models, 2022.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning, 2023.
- Hitea: Hierarchical temporal-aware video-language pre-training, 2022.
- Coca: Contrastive captioners are image-text foundation models, 2022.
- Self-chained image-language model for video localization and question answering, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language, 2022.
- Leveraging video descriptions to learn video question answering, 2016.
- Sigmoid loss for language image pre-training, 2023.
- A simple llm framework for long-range video question-answering, 2023a.
- Mm-narrator: Narrating long-form videos with multimodal in-context learning, 2023b.
- Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023c.
- Learning video representations from large language models. In arXiv preprint arXiv:2212.04501, 2022.
- Efficiently programming large language models using sglang, 2023.
- Uncovering temporal context for video question and answering, 2015.
- Chuyi Shang (1 paper)
- Amos You (1 paper)
- Sanjay Subramanian (18 papers)
- Trevor Darrell (324 papers)
- Roei Herzig (34 papers)