Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge (2402.16050v2)

Published 25 Feb 2024 in cs.CV and cs.CL

Abstract: Despite progress in multimodal LLMs (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16 longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available at https://github.com/bigai-nlco/VideoTGB

Definition Search Book Streamline Icon: https://streamlinehq.com
References (99)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. International Conference on Computer Vision (ICCV), pages 1708–1718.
  2. Visual prompting via image inpainting. In Advances in Neural Information Processing Systems (NeurIPS).
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
  4. Revisiting the “video” in video-language understanding. Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907–2917.
  5. João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733. IEEE Computer Society.
  6. Can pre-trained vision and language models answer visual information-seeking questions? ArXiv.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  10. Palm-e: An embodied multimodal language model.
  11. Heterogeneous memory enhanced multimodal attention model for video question answering. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1999–2007.
  12. Slowfast networks for video recognition. In International Conference on Computer Vision (ICCV), pages 6201–6210. IEEE.
  13. Mutual information-based temporal difference learning for human pose estimation in video. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 17131–17141.
  14. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv.
  15. Violet : End-to-end video-language transformers with masked visual-token modeling. ArXiv, abs/2111.12681.
  16. Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments. International Conference on Computer Vision (ICCV), pages 1655–1665.
  17. MIST : Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 14773–14783. IEEE.
  18. Llama-adapter v2: Parameter-efficient visual instruction model.
  19. Making pre-trained language models better few-shot learners. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 3816–3830. Association for Computational Linguistics.
  20. Agqa: A benchmark for compositional spatio-temporal reasoning. In Conference on Computer Vision and Pattern Recognition (CVPR).
  21. Localizing moments in video with natural language. International Conference on Computer Vision (ICCV), pages 5804–5813.
  22. Language is not all you need: Aligning perception with language models.
  23. Vop: Text-video co-operative prompt tuning for cross-modal retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6565–6574. IEEE.
  24. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1359–1367.
  25. Visual prompt tuning. In European Conference on Computer Vision (ECCV), volume 13693 of Lecture Notes in Computer Science, pages 709–727. Springer.
  26. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2000–2009.
  27. Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), pages 11109–11116. AAAI Press.
  28. Learning spatiotemporal features for infrared action recognition with 3d convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 309–317. IEEE Computer Society.
  29. Maple: Multi-modal prompt learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122.
  30. Dense-captioning events in videos. International Conference on Computer Vision (ICCV).
  31. Imagenhub: Standardizing the evaluation of conditional image generation models. ArXiv.
  32. Hierarchical conditional relation networks for video question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 9969–9978. Computer Vision Foundation / IEEE.
  33. Revealing single frame bias for video-and-language learning. ArXiv, abs/2206.03428.
  34. Detecting moments and highlights in videos via natural language queries. In Advances in Neural Information Processing Systems (NeurIPS).
  35. Less is more: Clipbert for video-and-language learning via sparse sampling. Conference on Computer Vision and Pattern Recognition (CVPR), pages 7327–7337.
  36. The power of scale for parameter-efficient prompt tuning. In Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059. Association for Computational Linguistics.
  37. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
  39. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems (NeurIPS).
  40. Videochat: Chat-centric video understanding. CoRR, abs/2305.06355.
  41. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv.
  42. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 4582–4597. Association for Computational Linguistics.
  43. Beyond rnns: Positional self-attention with co-attention for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), pages 8658–8665. AAAI Press.
  44. Efficient multimodal fusion via interactive prompting. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2604–2613.
  45. Efficient multimodal fusion via interactive prompting. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2604–2613. IEEE.
  46. Invariant grounding for video question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2918–2927. IEEE.
  47. Visual instruction tuning.
  48. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. CoRR, abs/2110.07602.
  49. Mmbench: Is your multi-modal model an all-around player? ArXiv.
  50. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV).
  51. Video swin transformer. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3192–3201.
  52. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems (NeurIPS), 35:2507–2521.
  53. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093.
  54. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration.
  55. Video-chatgpt: Towards detailed video understanding via large vision and language models.
  56. Video-chatgpt: Towards detailed video understanding via large vision and language models. CoRR, abs/2306.05424.
  57. Language models as knowledge bases? In Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2463–2473. Association for Computational Linguistics.
  58. Flowing convnets for human pose estimation in videos. In International Conference on Computer Vision (ICCV), pages 1913–1921.
  59. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  60. Faster r-cnn: Towards real-time object detection with region proposal networks. 39:1137–1149.
  61. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287. ACM.
  62. Moviechat: From dense token to sparse memory for long video understanding. CoRR, abs/2307.16449.
  63. Roformer: Enhanced transformer with rotary position embedding.
  64. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, pages 20841–20855. PMLR.
  65. Vipergpt: Visual inference via python execution for reasoning.
  66. Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. Lecture Notes in Computer Science, page 402–419.
  67. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73.
  68. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems (NeurIPS).
  69. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 6000–6010.
  70. PIVOT: prompting for video continual learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 24214–24223. IEEE.
  71. All in one: Exploring unified video-language pre-training. Conference on Computer Vision and Pattern Recognition (CVPR).
  72. Learning to prompt for continual learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 139–149. IEEE.
  73. Star: A benchmark for situated reasoning in real-world videos. In NeurIPS Datasets and Benchmarks.
  74. Generative visual prompt: Unifying distributional control of pre-trained generative models. In Advances in Neural Information Processing Systems (NeurIPS).
  75. Visual chatgpt: Talking, drawing and editing with visual foundation models.
  76. Next-qa: Next phase of question-answering to explaining temporal actions. Conference on Computer Vision and Pattern Recognition (CVPR), pages 9772–9781.
  77. Can I trust your answer? visually grounded video question answering. CoRR, abs/2309.01327.
  78. Can i trust your answer? visually grounded video question answering. ArXiv.
  79. Video graph transformer for video question answering. In European Conference on Computer Vision (ECCV), volume 13696 of Lecture Notes in Computer Science, pages 39–58. Springer.
  80. xiaoju ye. 2023. calflops: a flops and params calculate tool for neural networks in pytorch framework.
  81. Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995. IEEE Computer Society.
  82. Video question answering via gradually refined attention over appearance and motion. Association for Computing Machinery’s Annual Conference on Multimedia (ACM MM).
  83. mplug-2: A modularized multi-modal foundation model across text, image and video.
  84. Msr-vtt: A large video description dataset for bridging video and language. Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296.
  85. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Annual Meeting of the Association for Computational Linguistics (ACL).
  86. Prompt learns prompt: Exploring knowledge-aware generative prompt collaboration for video captioning. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1622–1630. ijcai.org.
  87. Just ask: Learning to answer questions from millions of narrated videos. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1686–1697.
  88. Zero-shot video question answering via frozen bidirectional language models. In Advances in Neural Information Processing Systems (NeurIPS).
  89. Self-chained image-language model for video localization and question answering.
  90. A joint sequence fusion model for video question answering and retrieval. European Conference on Computer Vision (ECCV).
  91. Activitynet-qa: A dataset for understanding complex web videos via question answering. AAAI Conference on Artificial Intelligence (AAAI).
  92. Scaling vision transformers. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1204–1213.
  93. Poseflow: A deep motion representation for understanding human behaviors in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6762–6770.
  94. Video-llama: An instruction-tuned audio-visual language model for video understanding.
  95. Temporal sentence grounding in videos: A survey and future directions. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), page 1–20.
  96. Magicbrush: A manually annotated dataset for instruction-guided image editing. ArXiv.
  97. Chatbridge: Bridging modalities with large language model as a language catalyst. CoRR, abs/2305.16103.
  98. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 130(9):2337–2348.
  99. Minigpt-4: Enhancing vision-language understanding with advanced large language models.
Citations (5)

Summary

We haven't generated a summary for this paper yet.