SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models (2407.15841v2)
Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video LLM that can jointly capture detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as much spatial detail as possible (e.g., with 12x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets. Code has been made available at: https://github.com/apple/ml-slowfast-llava.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219, 2024.
- GPT-4 technical report. arXiv:2303.08774, 2023.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- 4M-21: An any-to-any vision model for tens of tasks and modalities. arXiv:2406.09406, 2024.
- Collecting highly parallel data for paraphrase evaluation”. In ACL, 2011.
- VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv:2406.07476, 2024.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Slowfast networks for video recognition. In ICCV, 2019.
- 3D-LLM: Injecting the 3d world into large language models. NeurIPS, 2023.
- Mixtral of experts. arXiv:2401.04088, 2024.
- Language repository for long video understanding. arXiv:2403.14622, 2024.
- An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv:2403.18406, 2024.
- IntentQA: Context-aware video intent reasoning. In ICCV, 2023a.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
- VideoChat: Chat-centric video understanding. arXiv:2305.06355, 2023c.
- MVBench: A comprehensive multi-modal video understanding benchmark. arXiv:2311.17005, 2023d.
- LLaMA-VID: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023e.
- TGIF: A new dataset and benchmark on animated gif description. In CVPR, 2016.
- Video-LLaVA: Learning united visual representation by alignment before projection. arXiv:2311.10122, 2023.
- Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Vista-LLaMA: Reliable video narrator via equal distance to visual tokens. arXiv:2312.08870, 2023.
- VideoGPT+: Integrating image and video encoders for enhanced video understanding. arXiv:2406.09418, 2024a.
- Video-ChatGPT: Towards detailed video understanding via large vision and language models. In ACL, 2024b.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. NeurIPS, 2024.
- MM1: Methods, analysis & insights from multimodal llm pre-training. arXiv:2403.09611, 2024.
- Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. arXiv:2406.04334, 2024.
- 4M: Massively multimodal masked modeling. In NeurIPS, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Two-stream convolutional networks for action recognition in videos. NeurIPS, 2014.
- MovieChat: From dense token to sparse memory for long video understanding. arXiv:2307.16449, 2023.
- Moviechat+: Question-aware sparse memory for long video question answering. arXiv:2404.17176, 2024.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
- Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
- VideoAgent: Long-form video understanding with large language model as agent. arXiv:2403.10517, 2024a.
- VideoTree: Adaptive tree-based video representation for llm reasoning on long videos. arXiv:2405.19209, 2024b.
- Wenhao Wu. FreeVA: Offline mllm as training-free video assistant. arXiv:2405.07798, 2024.
- NExT-QA: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
- MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
- PLLaVA: Parameter-free llava extension from images to videos for video dense captioning. arXiv:2404.16994, 2024.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
- ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
- A simple llm framework for long-range video question-answering. arXiv:2312.17235, 2023a.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023b.
- Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv:2404.07973, 2024a.
- LLaVA-NeXT: A strong zero-shot video understanding model, 2024b. URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.
- Mingze Xu (28 papers)
- Mingfei Gao (26 papers)
- Zhe Gan (135 papers)
- Hong-You Chen (21 papers)
- Zhengfeng Lai (13 papers)
- Haiming Gang (6 papers)
- Kai Kang (25 papers)
- Afshin Dehghan (19 papers)