Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Slot-VLM: SlowFast Slots for Video-Language Modeling (2402.13088v1)

Published 20 Feb 2024 in cs.CV

Abstract: Video-LLMs (VLMs), powered by the advancements in LLMs, are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Bar, M. Visual objects in context. Nature Reviews Neuroscience, 5(8):617–629, 2004.
  3. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  4. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
  5. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  961–970, 2015.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  7. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  190–200, 2011.
  8. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 2023.
  11. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6202–6211, 2019.
  12. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16000–16009, 2022.
  13. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  15. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  16. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023c.
  17. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023d.
  18. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  19. Visual instruction tuning. Advances in Neural Information Processing Systems, 2023a.
  20. One for all: Video conversation is feasible without video instruction tuning. arXiv preprint arXiv:2309.15785, 2023b.
  21. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  22. Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  23. OpenAI. Chatgpt: Large language model for human style conversation. https://chat.openai.com, 2023.
  24. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  25. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999.
  26. Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860, 2022.
  27. Illiterate DALL-E learns to compose. International Conference on Learning Representations, 2022.
  28. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  30. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  31. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  5288–5296, 2016.
  32. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  33. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
  34. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiaqi Xu (49 papers)
  2. Cuiling Lan (60 papers)
  3. Wenxuan Xie (22 papers)
  4. Xuejin Chen (29 papers)
  5. Yan Lu (179 papers)
Citations (5)