Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Valley: Video Assistant with Large Language model Enhanced abilitY (2306.07207v2)

Published 12 Jun 2023 in cs.CV, cs.AI, and cs.CL

Abstract: LLMs, with their remarkable conversational capabilities, have demonstrated impressive performance across various applications and have emerged as formidable AI assistants. In view of this, it raises an intuitive question: Can we harness the power of LLMs to build multimodal AI assistants for visual applications? Recently, several multi-modal models have been developed for this purpose. They typically pre-train an adaptation module to align the semantics of the vision encoder and LLM, followed by fine-tuning on instruction-following data. However, despite the success of this pipeline in image and language understanding, its effectiveness in joint video and language understanding has not been widely explored. In this paper, we aim to develop a novel multi-modal foundation model capable of comprehending video, image, and language within a general framework. To achieve this goal, we introduce Valley, a Video Assistant with LLM Enhanced abilitY. The Valley consists of a LLM, a temporal modeling module, a visual encoder, and a simple projection module designed to bridge visual and textual modes. To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset and adopt a two-stage tuning procedure to train it. Specifically, we employ ChatGPT to facilitate the construction of task-oriented conversation data encompassing various tasks, including multi-shot captions, long video descriptions, action recognition, causal relationship inference, etc. Subsequently, we adopt a pre-training-then-instructions-tuned pipeline to align visual and textual modalities and improve the instruction-following capability of Valley. Qualitative experiments demonstrate that Valley has the potential to function as a highly effective video assistant that can make complex video understanding scenarios easy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  3. Collecting highly parallel data for paraphrase evaluation. In The 49th Annual Meeting of the Association for Computational Linguistics, 2011.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, 2021.
  8. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  961–970, 2015.
  9. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
  10. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023.
  11. Memecap: A dataset for captioning and interpreting memes. CoRR, abs/2305.13703, 2023.
  12. Otter: A multi-modal model with in-context instruction tuning. CoRR, abs/2305.03726, 2023a.
  13. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597, 2023b.
  14. Videochat: Chat-centric video understanding. CoRR, abs/2305.06355, 2023c.
  15. Visual instruction tuning. CoRR, abs/2304.08485, 2023.
  16. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022a.
  17. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
  18. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023.
  19. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  20. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  21. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
  22. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023.
  23. Pandagpt: One model to instruction-follow them all, 2023.
  24. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  25. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  26. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023.
  27. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  28. MSR-VTT: A large video description dataset for bridging video and language. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  29. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
  30. MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023.
  31. Activitynet-qa: A dataset for understanding complex web videos via question answering. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp.  9127–9134, 2019.
  32. GLM-130B: an open bilingual pre-trained model. CoRR, abs/2210.02414, 2022.
  33. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  34. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023a.
  36. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ruipu Luo (6 papers)
  2. Ziwang Zhao (4 papers)
  3. Min Yang (239 papers)
  4. Junwei Dong (2 papers)
  5. Da Li (95 papers)
  6. Pengcheng Lu (13 papers)
  7. Tao Wang (700 papers)
  8. Linmei Hu (14 papers)
  9. Minghui Qiu (58 papers)
  10. Zhongyu Wei (98 papers)
Citations (139)