Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering (2404.17176v1)

Published 26 Apr 2024 in cs.CV

Abstract: Recently, integrating video foundation models and LLMs to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal LLMs for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following https://github.com/rese1f/MovieChat.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (101)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Multiple-target tracking: A role for working memory? Quarterly journal of experimental psychology, 59(6):1101–1116, 2006.
  3. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781, 2024.
  4. Anthropic. Meet claude, 2023.
  5. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  6. Chapter: Human memory: A proposed system and its control processes. The psychology of learning and motivation, 2:89–195, 1968.
  7. Trecvid 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domains. arXiv preprint arXiv:2104.13473, 2021.
  8. Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval. arXiv preprint arXiv:2009.09984, 2020.
  9. Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In Proceedings of TRECVID 2018, 2018.
  10. Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning, and hyperlinking. In TREC Video Retrieval Evaluation (TRECVID), 2017.
  11. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  12. Frozen in time: A joint video and image encoder for end-to-end retrieval. CoRR, abs/2104.00650, 2021.
  13. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations, 2022.
  14. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  15. Memot: multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8090–8100, 2022.
  16. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  17. Deep vision multimodal learning: Methodology, benchmark, and trend. Applied Sciences, 12(13):6588, 2022.
  18. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  19. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 640–658. Springer, 2022.
  20. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  21. StableLM contributors. Stablevicuna. 2023.
  22. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  23. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  24. Eva: Exploring the limits of masked visual representation learning at scale. 2022.
  25. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  26. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
  27. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  28. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  29. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  30. Umotma: Underwater multiple object tracking with memory aggregation. Frontiers in Marine Science, 9:1071618, 2022.
  31. Learning position and target consistency for memory-based video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4144–4154, 2021.
  32. Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 709–727. Springer, 2020.
  33. hwchase17. langchain. https://github.com/hwchase17/langchain, 2023.
  34. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017.
  35. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  36. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  37. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  39. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
  40. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  41. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  42. Mavot: Memory-augmented video object tracking. arXiv preprint arXiv:1711.09414, 2017.
  43. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  44. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  45. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  46. Adaptive correlation filters with long-term and short-term memory for object tracking. International Journal of Computer Vision, 126:771–796, 2018.
  47. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870, 2023.
  48. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  49. Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
  50. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  51. openai. Gpt3.5, 2021. 2021.
  52. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  53. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
  54. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013.
  55. Coherent multi-sentence video description with variable level of detail. In Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings 36, pages 184–195. Springer, 2014.
  56. Generating descriptions with grounded and co-referenced people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4979–4989, 2017.
  57. A database for fine grained activity detection of cooking activities. In 2012 IEEE conference on computer vision and pattern recognition, pages 1194–1201. IEEE, 2012.
  58. Script data for attribute-based recognition of composite activities. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12, pages 144–157. Springer, 2012.
  59. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, 119:346–373, 2016.
  60. Temporal aggregate representations for long-range video understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 154–171. Springer, 2020.
  61. Kernelized memory network for video object segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 629–645. Springer, 2020.
  62. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12889–12898, 2021.
  63. Generic event boundary detection: A benchmark for event segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8075–8084, 2021.
  64. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5026–5035, 2022.
  65. Memory consolidation. Cold Spring Harbor perspectives in biology, 7(8):a021766, 2015.
  66. Jianlin Su. Bert position encoding. https://kexue.fm/archives/7947, 2023.
  67. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  68. Stanford alpaca: An instruction-following llama model, 2023.
  69. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
  70. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  71. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  72. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023.
  73. Memory-and-anticipation transformer for online action understanding. arXiv preprint arXiv:2308.07863, 2023.
  74. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  75. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  76. Longvlm: Efficient long video understanding via large language models. arXiv preprint arXiv:2404.03384, 2024.
  77. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
  78. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
  79. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
  80. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  81. Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327, 2023.
  82. Multi-object tracking with spatial-temporal correlation memory networks. In 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), pages 616–619. IEEE, 2022.
  83. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  84. Haiyang Xu et al. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
  85. Msr-vtt: A large video description dataset for bridging video and language. June 2016.
  86. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
  87. Learning dynamic memory networks for object tracking. In Proceedings of the European conference on computer vision (ECCV), pages 152–167, 2018.
  88. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  89. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019.
  90. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023.
  91. Title generation for user generated videos. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 609–625. Springer, 2016.
  92. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  93. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  94. Real-time online video detection with temporal smoothing transformers. In European Conference on Computer Vision, pages 485–502. Springer, 2022.
  95. See and think: Embodied agent in virtual environment. arXiv preprint arXiv:2311.15209, 2023.
  96. Hierarchical auto-organizing system for open-ended multi-agent navigation. arXiv preprint arXiv:2403.08282, 2024.
  97. Do we really need a complex agent system? distill embodied agent into a single model. arXiv preprint arXiv:2404.04619, 2024.
  98. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882, 2023.
  99. Streaming dense video captioning. arXiv preprint arXiv:2404.01297, 2024.
  100. Memory network with pixel-level spatio-temporal learning for visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  101. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Enxin Song (6 papers)
  2. Wenhao Chai (50 papers)
  3. Tian Ye (65 papers)
  4. Jenq-Neng Hwang (103 papers)
  5. Xi Li (197 papers)
  6. Gaoang Wang (68 papers)
Citations (17)