FLAM: Frame-Wise Language-Audio Modeling (2505.05335v2)

Published 8 May 2025 in cs.SD and eess.AS

Abstract: Recent multi-modal audio-LLMs (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-LLM capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/justin_salamon/status/1937585988093784148

https://twitter.com/justin_salamon/status/1937587538044375504

https://twitter.com/tsirigoc/status/1937993225429451002

https://twitter.com/permutans/status/1937831811024863356

FLAM: Frame-Wise Language-Audio Modeling (2505.05335v2)

Summary

Related Papers

Tweets