Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Image to Video, what do we need in multimodal LLMs? (2404.11865v1)

Published 18 Apr 2024 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with LLMs to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these methods.In response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  4. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  5. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  8. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR 2021.
  9. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023).
  10. Vtimellm: Empower llm to grasp video moments. arXiv preprint arXiv:2311.18445 2 (2023).
  11. OpenCLIP. https://doi.org/10.5281/zenodo.5143773
  12. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2758–2766.
  13. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046 (2023).
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  15. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
  16. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
  17. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023).
  18. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
  19. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  20. One for all: Video conversation is feasible without video instruction tuning. arXiv preprint arXiv:2309.15785 (2023).
  21. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023).
  22. OpenAI. 2023a. ChatGPT. https://openai.com/blog/chatgpt/. Accessed: 2023-04-13.
  23. OpenAI. 2023b. GPT-4v(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031. Accessed: 2023-04-13.
  24. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
  26. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023).
  27. Stanford Alpaca: An Instruction-Following Llama Model. https://github.com/tatsu-lab/stanford_alpaca.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  30. Attention is all you need. Advances in neural information processing systems 30 (2017).
  31. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  32. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. 1645–1653.
  33. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems 35 (2022), 124–141.
  34. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
  35. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
  36. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9127–9134.
  37. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469 (2023).
  38. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023).
  39. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
  40. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  41. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Suyuan Huang (3 papers)
  2. Haoxin Zhang (7 papers)
  3. Yan Gao (157 papers)
  4. Yao Hu (106 papers)
  5. Zengchang Qin (29 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com