Look, Remember and Reason: Grounded reasoning in videos with language models (2306.17778v3)
Abstract: Multi-modal LLMs (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual information is extracted using low-level visual skills step-by-step and then integrated to arrive at a final answer. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across these tasks by a large margin.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. CoRR, abs/2308.01390, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023.
- Is space-time attention all you need for video understanding? In Marina Meila and Tong Zhang (eds.), ICML, 2021.
- Temporally grounding natural sentence in video. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), ACL, 2018.
- Localizing natural language in videos. In AAAI, 2019.
- A unified sequence interface for vision tasks. In NeurIPS, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
- Transvg: End-to-end visual grounding with transformers. In ICCV, 2021.
- Attention over learned object embeddings enables complex visual reasoning. In NeurIPS, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Palm-e: An embodied multimodal language model. CoRR, abs/2303.03378, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027, 2021.
- Large language models are not abstract reasoners. CoRR, abs/2305.19555, 2023.
- CATER: A diagnostic dataset for compositional actions & temporal reasoning. In ICLR, 2020.
- The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
- Visual programming: Compositional visual reasoning without training. CoRR, abs/2211.11559, 2022.
- Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In AAAI, 2019.
- Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017.
- Finding "it": Weakly-supervised reference-aware visual grounding in instructional videos. In CVPR, 2018.
- Compositional attention networks for machine reasoning. In ICLR, 2018.
- Knowing where to focus: Event-aware transformer for video grounding. CoRR, abs/2308.06947, 2023.
- Embracing consistency: A one-stage approach for spatio-temporal video grounding. In NeurIPS, 2022.
- MDETR - modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
- The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
- Grounding language models to images for multimodal generation. CoRR, abs/2301.13823, 2023.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
- Videochat: Chat-centric video understanding. CoRR, abs/2305.06355, 2023b.
- WINNER: weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In CVPR, 2023c.
- Calibrating concepts and operations: Towards symbolic reasoning on real images. In ICCV, 2021.
- Decoupled weight decay regularization. In ICLR, 2019.
- Chameleon: Plug-and-play compositional reasoning with large language models. CoRR, abs/2304.09842, 2023.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020.
- Valley: Video assistant with large language model enhanced ability. CoRR, abs/2306.07207, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. CoRR, abs/2306.05424, 2023.
- Diverse image captioning with context-object split latent spaces. In NeurIPS, 2020.
- Something-else: Compositional action recognition with spatial-temporal interaction networks. In CVPR, 2020.
- Learning to reason over visual objects. CoRR, abs/2303.02260, 2023.
- Moments in time dataset: One million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell., 42(2):502–508, 2020.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. CoRR, abs/2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Dynamic inference with neural interpreters. In NeurIPS, 2021.
- Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
- A simple neural network module for relational reasoning. In NeurIPS, 2017.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
- Learning object permanence from video. In ECCV, 2020.
- Not all frames are equal: Weakly-supervised video grounding wit contextual similarity and visual clustering losses. In CVPR, 2019.
- Two-stream convolutional networks for action recognition in videos. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), NeurIPS, 2014.
- Visual reasoning with multi-hop feature modulation. In ECCV, 2018.
- Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In ICCV, 2021.
- Does visual pretraining help end-to-end reasoning? CoRR, abs/2307.08506, 2023.
- Vipergpt: Visual inference via python execution for reasoning. CoRR, abs/2303.08128, 2023.
- Multiple people tracking by lifted multicut and person re-identification. In CVPR, 2017.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Learning what and where - unsupervised disentangling location and identity tracking. In ICLR, 2023.
- Object referring in videos with language and human gaze. In CVPR, 2018.
- Attention is all you need. In NeurIPS, 2017.
- Internvideo: General video foundation models via generative and discriminative learning. CoRR, abs/2212.03191, 2022.
- Chain of thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- STAR: A benchmark for situated reasoning in real-world videos. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), NeurIPS, 2021.
- Can I trust your answer? visually grounded video question answering. CoRR, abs/2309.01327, 2023.
- Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, 2022.
- A fast and accurate one-stage approach to visual grounding. In ICCV, 2019.
- Improving one-stage visual grounding by recursive sub-query construction. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), ECCV, 2020.
- Cascaded mutual modulation for visual reasoning. In EMNLP, 2018.
- Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., 44(6):2872–2893, 2022.
- Self-chained image-language model for video localization and question answering. CoRR, abs/2305.06988, 2023.
- Dense regression network for video grounding. In CVPR, 2020.
- ACRE: abstract causal reasoning beyond covariation. In CVPR, 2021.
- Learning algebraic representation for systematic generalization in abstract reasoning. In ECCV, 2022a.
- MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR, 2019.
- Span-based localizing network for natural language video localization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), ACL, 2020.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023a.
- Shiwen Zhang. Tfcnet: Temporal fully connected networks for static unbiased temporal reasoning. CoRR, 2022.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022b.
- Multimodal chain-of-thought reasoning in language models. CoRR, abs/2302.00923, 2023b.
- Hopper: Multi-hop transformer for spatiotemporal reasoning. In ICLR, 2021.