Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time (2407.01851v2)

Published 1 Jul 2024 in cs.CV, cs.AI, cs.LG, and eess.AS

Abstract: Leveraging LLMs' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sanjoy Chowdhury (11 papers)
  2. Sayan Nag (38 papers)
  3. Subhrajyoti Dasgupta (4 papers)
  4. Jun Chen (374 papers)
  5. Mohamed Elhoseiny (102 papers)
  6. Ruohan Gao (39 papers)
  7. Dinesh Manocha (366 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com