Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning (2410.19702v2)

Published 25 Oct 2024 in cs.CV, cs.AI, and cs.MM

Abstract: Multimodal LLMs (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.  5803–5812, 2017.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  5. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.
  6. Videoagent: A memory-augmented multimodal agent for video understanding. arXiv preprint arXiv:2403.11481, 2024.
  7. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024.
  8. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pp.  5267–5275, 2017.
  9. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989, 2023.
  10. Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, pp.  505–520. Springer, 2014.
  11. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14271–14280, 2024a.
  12. Lita: Language instructed temporal-localization assistant. arXiv preprint arXiv:2403.19046, 2024b.
  13. Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
  14. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  15. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13700–13710, 2024.
  18. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp.  706–715, 2017.
  19. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021a.
  20. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021b.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.  19730–19742. PMLR, 2023a.
  22. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  23. Unmasked teacher: Towards training-efficient video foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  19948–19960, 2023c.
  24. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22195–22206, 2024a.
  25. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023d.
  26. Groundinggpt: Language enhanced multi-modal grounding model. CoRR, 2024b.
  27. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023a.
  28. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2794–2804, 2023b.
  29. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a.
  30. St-llm: Large language models are effective temporal learners. arXiv preprint arXiv:2404.00308, 2024b.
  31. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  32. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36, 2023.
  33. Correlation-guided query-dependency calibration in video representation learning for temporal grounding. arXiv preprint arXiv:2311.08835, 2023a.
  34. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23023–23033, 2023b.
  35. Queryd: A video dataset with high-quality text and audio narrations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  2265–2269. IEEE, 2021.
  36. Chatvtg: Video temporal grounding via chat with video dialogue large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1847–1856, 2024.
  37. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14313–14323, 2024.
  38. Share. Sharegemini: Scaling up video caption data for multimodal large language models, June 2024. URL https://github.com/Share14/ShareGemini.
  39. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18221–18232, 2024a.
  40. Moviechat+: Question-aware sparse memory for long video question answering. arXiv preprint arXiv:2404.17176, 2024b.
  41. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5179–5187, 2015.
  42. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1207–1216, 2019.
  43. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  44. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
  45. Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell., 41(11):2740–2755, 2019.
  46. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024a.
  47. Videoagent: Long-form video understanding with large language model as agent. arXiv preprint arXiv:2403.10517, 2024b.
  48. Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture. arXiv preprint arXiv:2409.02889, 2024c.
  49. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024d.
  50. Hawkeye: Training video-text llms for grounding text in videos. arXiv preprint arXiv:2403.10228, 2024e.
  51. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. arXiv preprint arXiv:2405.19209, 2024f.
  52. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. arXiv preprint arXiv:2406.08394, 2024.
  53. Can i trust your answer? visually grounded video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13204–13214, 2024.
  54. Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188, 2024.
  55. Unloc: A unified framework for video localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  13623–13633, 2023.
  56. Merlin: Empowering multimodal llms with foresight minds. arXiv preprint arXiv:2312.00589, 2023.
  57. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23056–23065, 2023.
  58. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16375–16387, 2022.
  59. Adaptive edge-aware semantic interaction network for salient object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  60. Unimd: Towards unifying moment retrieval and temporal action detection. arXiv preprint arXiv:2404.04933, 2024.
  61. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  62. Flash-vstream: Memory-based real-time understanding for long video streams. arXiv preprint arXiv:2406.08085, 2024a.
  63. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024b.
  64. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024.
  65. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com