Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Look, Remember and Reason: Grounded reasoning in videos with language models (2306.17778v3)

Published 30 Jun 2023 in cs.CV and cs.LG

Abstract: Multi-modal LLMs (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual information is extracted using low-level visual skills step-by-step and then integrated to arrive at a final answer. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across these tasks by a large margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. CoRR, abs/2308.01390, 2023.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023.
  4. Is space-time attention all you need for video understanding? In Marina Meila and Tong Zhang (eds.), ICML, 2021.
  5. Temporally grounding natural sentence in video. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), ACL, 2018.
  6. Localizing natural language in videos. In AAAI, 2019.
  7. A unified sequence interface for vision tasks. In NeurIPS, 2022.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  9. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  12. Transvg: End-to-end visual grounding with transformers. In ICCV, 2021.
  13. Attention over learned object embeddings enables complex visual reasoning. In NeurIPS, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  15. Palm-e: An embodied multimodal language model. CoRR, abs/2303.03378, 2023.
  16. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027, 2021.
  17. Large language models are not abstract reasoners. CoRR, abs/2305.19555, 2023.
  18. CATER: A diagnostic dataset for compositional actions & temporal reasoning. In ICLR, 2020.
  19. The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
  20. Visual programming: Compositional visual reasoning without training. CoRR, abs/2211.11559, 2022.
  21. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In AAAI, 2019.
  22. Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017.
  23. Finding "it": Weakly-supervised reference-aware visual grounding in instructional videos. In CVPR, 2018.
  24. Compositional attention networks for machine reasoning. In ICLR, 2018.
  25. Knowing where to focus: Event-aware transformer for video grounding. CoRR, abs/2308.06947, 2023.
  26. Embracing consistency: A one-stage approach for spatio-temporal video grounding. In NeurIPS, 2022.
  27. MDETR - modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
  28. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  29. Grounding language models to images for multimodal generation. CoRR, abs/2301.13823, 2023.
  30. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  31. Videochat: Chat-centric video understanding. CoRR, abs/2305.06355, 2023b.
  32. WINNER: weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In CVPR, 2023c.
  33. Calibrating concepts and operations: Towards symbolic reasoning on real images. In ICCV, 2021.
  34. Decoupled weight decay regularization. In ICLR, 2019.
  35. Chameleon: Plug-and-play compositional reasoning with large language models. CoRR, abs/2304.09842, 2023.
  36. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020.
  37. Valley: Video assistant with large language model enhanced ability. CoRR, abs/2306.07207, 2023.
  38. Video-chatgpt: Towards detailed video understanding via large vision and language models. CoRR, abs/2306.05424, 2023.
  39. Diverse image captioning with context-object split latent spaces. In NeurIPS, 2020.
  40. Something-else: Compositional action recognition with spatial-temporal interaction networks. In CVPR, 2020.
  41. Learning to reason over visual objects. CoRR, abs/2303.02260, 2023.
  42. Moments in time dataset: One million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell., 42(2):502–508, 2020.
  43. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  44. Kosmos-2: Grounding multimodal large language models to the world. CoRR, abs/2306.14824, 2023.
  45. Learning transferable visual models from natural language supervision. In ICML, 2021.
  46. Dynamic inference with neural interpreters. In NeurIPS, 2021.
  47. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
  48. A simple neural network module for relational reasoning. In NeurIPS, 2017.
  49. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
  50. Learning object permanence from video. In ECCV, 2020.
  51. Not all frames are equal: Weakly-supervised video grounding wit contextual similarity and visual clustering losses. In CVPR, 2019.
  52. Two-stream convolutional networks for action recognition in videos. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), NeurIPS, 2014.
  53. Visual reasoning with multi-hop feature modulation. In ECCV, 2018.
  54. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In ICCV, 2021.
  55. Does visual pretraining help end-to-end reasoning? CoRR, abs/2307.08506, 2023.
  56. Vipergpt: Visual inference via python execution for reasoning. CoRR, abs/2303.08128, 2023.
  57. Multiple people tracking by lifted multicut and person re-identification. In CVPR, 2017.
  58. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  59. Learning what and where - unsupervised disentangling location and identity tracking. In ICLR, 2023.
  60. Object referring in videos with language and human gaze. In CVPR, 2018.
  61. Attention is all you need. In NeurIPS, 2017.
  62. Internvideo: General video foundation models via generative and discriminative learning. CoRR, abs/2212.03191, 2022.
  63. Chain of thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  64. STAR: A benchmark for situated reasoning in real-world videos. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), NeurIPS, 2021.
  65. Can I trust your answer? visually grounded video question answering. CoRR, abs/2309.01327, 2023.
  66. Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, 2022.
  67. A fast and accurate one-stage approach to visual grounding. In ICCV, 2019.
  68. Improving one-stage visual grounding by recursive sub-query construction. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), ECCV, 2020.
  69. Cascaded mutual modulation for visual reasoning. In EMNLP, 2018.
  70. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., 44(6):2872–2893, 2022.
  71. Self-chained image-language model for video localization and question answering. CoRR, abs/2305.06988, 2023.
  72. Dense regression network for video grounding. In CVPR, 2020.
  73. ACRE: abstract causal reasoning beyond covariation. In CVPR, 2021.
  74. Learning algebraic representation for systematic generalization in abstract reasoning. In ECCV, 2022a.
  75. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR, 2019.
  76. Span-based localizing network for natural language video localization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), ACL, 2020.
  77. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023a.
  78. Shiwen Zhang. Tfcnet: Temporal fully connected networks for static unbiased temporal reasoning. CoRR, 2022.
  79. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022b.
  80. Multimodal chain-of-thought reasoning in language models. CoRR, abs/2302.00923, 2023b.
  81. Hopper: Multi-hop transformer for spatiotemporal reasoning. In ICLR, 2021.
Citations (2)

Summary

  • The paper proposes the LRR framework that integrates low-level surrogate tasks and language model enhancements to improve video reasoning.
  • It employs a two-stream video encoder and cross-attention layers to effectively capture both static scene details and dynamic object motions.
  • Empirical results show significant improvements on benchmarks like ACRE, CATER, and Something-Else, setting new state-of-the-art performance levels.

Insightful Overview of "Look, Remember and Reason: Grounded reasoning in videos with LLMs"

The paper "Look, Remember and Reason: Grounded reasoning in videos with LLMs" presents an innovative methodology for enhancing the reasoning capabilities of multi-modal LLMs (LMs) when dealing with video inputs. It addresses the challenges of causal and spatiotemporal reasoning by proposing an approach that is fundamentally grounded in low-level visual detail extraction, making it a significant contribution to the area of machine reasoning with heterogeneous sensory inputs.

Key Methodological Advances

The central premise of the paper is the introduction of the Look, Remember, and Reason (LRR) framework. This approach emphasizes an intricate three-step process: looking at the visual scene to extract relevant low-level information, remembering by maintaining these details within the model's working memory, and reasoning to synthesize a response through high-level cognitive processing. Each of these steps is facilitated by novel components within the LRR architecture:

  • Low-level Surrogate Tasks: The authors propose training LMs on low-level surrogate tasks such as object recognition, re-identification, and tracking. These tasks endow the model with the ability to ground its reasoning processes in fine-grained visual cues, crucial for understanding object motion and interactions in video data.
  • Two-Stream Video Encoder: Utilizing spatiotemporal attention mechanisms, this component effectively captures both static and dynamic features of the video frames. It enables the model to discern scene structure and object motion, addressing the density and complexity inherent in video data.
  • Cross Attention Layers in LM: By embedding cross attention layers between self-attention layers, the LRR model leverages the LLM's global semantic understanding to refine the extraction of low-level visual information. This top-down cross-attention facilitates the integration of visual information into the reasoning process.

Strong Numerical Results

The paper demonstrates impressive numerical results across various benchmarks, illustrating the effectiveness of the LRR framework. Specifically, the model significantly outperforms state-of-the-art approaches on datasets like ACRE, CATER, Something-Else, and STAR. On the ACRE dataset, the LRR model achieved an accuracy of 98.2% on the compositional split and 99.2% on the systematic split, surpassing other methods by a wide margin. Similarly, it exceeds previous results on the challenging compositional split of the Something-Else dataset and is competitive in object tracking tasks required by the CATER and STAR datasets. These results attest to the model's flexibility and its robust reasoning capability in diverse scenarios.

Implications and Future Directions

The practical implications of this research are considerable, given the increasing demand for AI systems that can interpret and reason about complex video data in fields such as autonomous vehicles, surveillance, and interactive AI systems. The capability to ground reasoning in low-level visual information while leveraging high-level LLM insights extends the functional reach of AI, potentially leading to more intuitive and context-aware digital assistants.

Theoretically, the approach outlined in this paper suggests new avenues for advancing multi-modal LMs, particularly by enhancing their capacity to process spatiotemporal information. Future work could explore the scalability of this framework with larger models or its application to other types of multi-modal data beyond video. Additionally, investigations into optimizing surrogate task selection and the incorporation of additional modalities could further refine model performance.

Overall, "Look, Remember and Reason: Grounded reasoning in videos with LLMs" enriches the field of multi-modal learning and lays the groundwork for more sophisticated AI capabilities in understanding dynamic video environments.

Youtube Logo Streamline Icon: https://streamlinehq.com