Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models (2407.15841v2)

Published 22 Jul 2024 in cs.CV

Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video LLM that can jointly capture detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as much spatial detail as possible (e.g., with 12x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets. Code has been made available at: https://github.com/apple/ml-slowfast-llava.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219, 2024.
  2. GPT-4 technical report. arXiv:2303.08774, 2023.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  4. 4M-21: An any-to-any vision model for tens of tasks and modalities. arXiv:2406.09406, 2024.
  5. Collecting highly parallel data for paraphrase evaluation”. In ACL, 2011.
  6. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv:2406.07476, 2024.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Slowfast networks for video recognition. In ICCV, 2019.
  9. 3D-LLM: Injecting the 3d world into large language models. NeurIPS, 2023.
  10. Mixtral of experts. arXiv:2401.04088, 2024.
  11. Language repository for long video understanding. arXiv:2403.14622, 2024.
  12. An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv:2403.18406, 2024.
  13. IntentQA: Context-aware video intent reasoning. In ICCV, 2023a.
  14. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  15. VideoChat: Chat-centric video understanding. arXiv:2305.06355, 2023c.
  16. MVBench: A comprehensive multi-modal video understanding benchmark. arXiv:2311.17005, 2023d.
  17. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023e.
  18. TGIF: A new dataset and benchmark on animated gif description. In CVPR, 2016.
  19. Video-LLaVA: Learning united visual representation by alignment before projection. arXiv:2311.10122, 2023.
  20. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a.
  21. Visual instruction tuning. In NeurIPS, 2023b.
  22. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  23. Vista-LLaMA: Reliable video narrator via equal distance to visual tokens. arXiv:2312.08870, 2023.
  24. VideoGPT+: Integrating image and video encoders for enhanced video understanding. arXiv:2406.09418, 2024a.
  25. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In ACL, 2024b.
  26. Egoschema: A diagnostic benchmark for very long-form video language understanding. NeurIPS, 2024.
  27. MM1: Methods, analysis & insights from multimodal llm pre-training. arXiv:2403.09611, 2024.
  28. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. arXiv:2406.04334, 2024.
  29. 4M: Massively multimodal masked modeling. In NeurIPS, 2023.
  30. Learning transferable visual models from natural language supervision. In ICML, 2021.
  31. Two-stream convolutional networks for action recognition in videos. NeurIPS, 2014.
  32. MovieChat: From dense token to sparse memory for long video understanding. arXiv:2307.16449, 2023.
  33. Moviechat+: Question-aware sparse memory for long video question answering. arXiv:2404.17176, 2024.
  34. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
  35. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023.
  36. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
  38. VideoAgent: Long-form video understanding with large language model as agent. arXiv:2403.10517, 2024a.
  39. VideoTree: Adaptive tree-based video representation for llm reasoning on long videos. arXiv:2405.19209, 2024b.
  40. Wenhao Wu. FreeVA: Offline mllm as training-free video assistant. arXiv:2405.07798, 2024.
  41. NExT-QA: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  42. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
  43. PLLaVA: Parameter-free llava extension from images to videos for video dense captioning. arXiv:2404.16994, 2024.
  44. Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
  45. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
  46. A simple llm framework for long-range video question-answering. arXiv:2312.17235, 2023a.
  47. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023b.
  48. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv:2404.07973, 2024a.
  49. LLaVA-NeXT: A strong zero-shot video understanding model, 2024b. URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mingze Xu (28 papers)
  2. Mingfei Gao (26 papers)
  3. Zhe Gan (135 papers)
  4. Hong-You Chen (21 papers)
  5. Zhengfeng Lai (13 papers)
  6. Haiming Gang (6 papers)
  7. Kai Kang (25 papers)
  8. Afshin Dehghan (19 papers)
Citations (16)