Language as the Medium: Multimodal Video Classification through text only (2309.10783v1)
Abstract: Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by LLMs, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for sight'' or
hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
- https://github.com/guillaumekln/faster-whisper.
- https://platform.openai.com/docs/guides/gpt/chat-completions-api.
- https://www.anthropic.com/index/introducing-claude.
- Alternating gradient descent and mixture-of-experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324, 2023.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Towards language models that can see: Computer vision through the lens of natural language. arXiv:2306.16410, 2023.
- Language models are few-shot learners. NeurIPS, 2020.
- Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv:2304.04227, 2023.
- Imagebind: One embedding space to bind them all. In CVPR, 2023.
- Vtc: Improving video-text retrieval with user comments. In ECCV, 2022.
- Language is not all you need: Aligning perception with language models. arXiv:2302.14045, 2023.
- Perceiver IO: A general architecture for structured inputs & outputs. arXiv:2107.14795, 2021.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
- OpenAI. GPT-4 technical report, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Robust speech recognition via large-scale weak supervision. In ICML, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv:2211.05100, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
- Laura Hanu (3 papers)
- Anita L. Verő (3 papers)
- James Thewlis (10 papers)