Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-visual training for improved grounding in video-text LLMs (2407.15046v1)

Published 21 Jul 2024 in cs.CV, cs.CL, and cs.MM

Abstract: Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shivprasad Sagare (4 papers)
  2. Hemachandran S (1 paper)
  3. Kinshuk Sarabhai (1 paper)
  4. Prashant Ullegaddi (1 paper)
  5. Rajeshkumar SA (1 paper)