Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Simple LLM Framework for Long-Range Video Question-Answering (2312.17235v3)

Published 28 Dec 2023 in cs.CV
A Simple LLM Framework for Long-Range Video Question-Answering

Abstract: We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a LLM (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

Introduction to Long-Range Video Question-Answering

The field of AI has made considerable advancements in understanding short video clips, typically ranging from a few seconds to a minute. However, comprehending longer video sequences – which can span several minutes or even hours – introduces complex challenges. To enhance long-range video understanding, especially for question-answering tasks, researchers generally rely on intricate models equipped with advanced temporal reasoning capabilities. Traditional methods invest heavily in specialized video modeling designs, such as long-range feature banks and space-time graphs, which are costly and intricate.

A Simplified Framework Using LLMs

A paper introduces a novel, language-based framework that simplifies the approach to long-range video question-answering (LVQA). Known as LLoVi, this framework uniquely combines a short-term visual captioner with a LLM like GPT-3.5 or GPT-4, successfully leveraging the LLM's powerful ability for long-range reasoning. Instead of incorporating complex video-specific techniques, LLoVi harnesses two stages: initially, it segments a long video into short clips, each described textually by a visual captioner. Subsequently, an LLM integrates these descriptions to perform comprehensive video reasoning and answer questions about the video content.

Crucial Factors and Methodology Insights

An extensive empirical paper within the paper highlights several critical components for effective LVQA performance. The choice of both the visual captioner and the LLM proved to be significant. It was further discovered that a specialized LLM prompt structure substantially elevates performance. This prompt instructs the LLM to first deliver a consolidated summary of the video captions, which simplifies the task of accurately responding to questions based on this synthesized narrative. Remarkably, this framework demonstrated superior accuracy on the EgoSchema benchmark, surpassing former leading techniques by considerable margins.

Generalization and Grounded Question-Answering

This streamlined framework proved its robustness across a variety of datasets, indicating its applicability to diverse LVQA scenarios. Moreover, the researchers extended the framework to 'grounded LVQA', where the model identifies and grounds the specific video segment relevant to a question. This extension led to the framework outperforming existing methods on a benchmark designed for this purpose.

Conclusion

The simplicity and zero-shot learning ability of LLoVi make it a promising direction for future development in video understanding. Details about the paper can be examined more closely in the literature, and the code implementation for this framework is openly available, which benefits the research community. By abstaining from complicated video-specific mechanisms, LLoVi empowers LLMs to tap into their innate long-range reasoning and does so with noteworthy efficiency and efficacy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Is space-time attention all you need for video understanding? In ICML, page 4, 2021.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023.
  4. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10337–10346, 2020.
  5. Tallformer: Temporal action localization with a long-memory transformer. In European Conference on Computer Vision, pages 503–521. Springer, 2022.
  6. Vindlu: A recipe for effective video-and-language pretraining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  7. Dramaqa: Character-centered video story understanding with hierarchical QA. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 1166–1174. AAAI Press, 2021a.
  8. Dramaqa: Character-centered video story understanding with hierarchical qa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1166–1174, 2021b.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Long story short: a summarize-then-search method for long video question answering. In BMVC, 2023.
  11. VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. In arXiv:2111.1268, 2021.
  12. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  13. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  14. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  15. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  16. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  17. Timeception for complex action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 254–263, 2019a.
  18. Videograph: Recognizing minutes-long human activities in videos. arXiv preprint arXiv:1905.05143, 2019b.
  19. Long movie clip classification with state-space video models. In European Conference on Computer Vision, pages 87–104. Springer, 2022.
  20. Efficient movie scene detection using state-space transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18749–18758, 2023.
  21. Are we asking the right questions in movieqa? In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  22. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  23. A memory network approach for story-based temporal summarization of 360 videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1419, 2018.
  24. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3054–3063, 2021.
  25. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  26. Less is more: Clipbert for video-and-language learningvia sparse sampling. In CVPR, 2021.
  27. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  28. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  29. Intentqa: Context-aware video intent reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11963–11974, 2023c.
  30. Intentqa: Context-aware video intent reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11963–11974, 2023d.
  31. Videochat: Chat-centric video understanding, 2023e.
  32. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937, 2022.
  33. Mm-vid: Advancing video understanding with gpt-4v(ision). arXiv preprint arXiv:2310.19773, 2023a.
  34. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023b.
  35. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023a.
  36. Visual instruction tuning. In NeurIPS, 2023b.
  37. Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
  38. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  39. Verbs in action: Improving verb understanding in video-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15579–15591, 2023.
  40. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  41. OpenAI. Gpt-4 technical report, 2023.
  42. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  43. Egocentric video-language pretraining. arXiv e-prints, pages arXiv–2206, 2022.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  45. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  47. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023a.
  48. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023b.
  49. Pearl: Prompting large language models to plan and execute actions over long documents. arXiv preprint arXiv:2305.14564, 2023.
  50. Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems, 35:38032–38045, 2022.
  51. Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
  52. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
  53. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  55. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023a.
  56. Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6387–6397, 2023b.
  57. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada, 2023c. Association for Computational Linguistics.
  58. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428, 2023d.
  59. Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023e.
  60. Supervoxel attention graphs for long-range video modeling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 155–166, 2021.
  61. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022a.
  62. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483–8497, 2022b.
  63. Unified coarse-to-fine alignment for video-text retrieval. arXiv preprint arXiv:2309.10091, 2023f.
  64. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  65. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
  66. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
  67. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
  68. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  69. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), pages 2804–2812, 2022a.
  70. Video graph transformer for video question answering. In European Conference on Computer Vision, pages 39–58. Springer, 2022b.
  71. Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327, 2023.
  72. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1686–1697, 2021.
  73. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022a.
  74. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022b.
  75. What gives the answer away? question answering bias analysis on video qa datasets. arXiv preprint arXiv:2007.03626, 2020.
  76. Relational space-time query in long-form videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6398–6408, 2023.
  77. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  78. Keunwoo Peter Yu. VideoBLIP.
  79. Self-chained image-language model for video localization and question answering. NeurIPS, 2023.
  80. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019.
  81. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv, 2022.
  82. Temporal query networks for fine-grained video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4486–4496, 2021.
  83. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ce Zhang (215 papers)
  2. Taixi Lu (3 papers)
  3. Md Mohaiminul Islam (13 papers)
  4. Ziyang Wang (59 papers)
  5. Shoubin Yu (15 papers)
  6. Mohit Bansal (304 papers)
  7. Gedas Bertasius (55 papers)
Citations (47)