Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning (2410.10913v3)
Abstract: Retrieval-augmented generation can improve audio captioning by incorporating relevant audio-text pairs from a knowledge base. Existing methods typically rely solely on the input audio as a unimodal retrieval query. In contrast, we propose Generation-Assisted Multimodal Querying, which generates a text description of the input audio to enable multimodal querying. This approach aligns the query modality with the audio-text structure of the knowledge base, leading to more effective retrieval. Furthermore, we introduce a novel progressive learning strategy that gradually increases the number of interleaved audio-text pairs to enhance the training process. Our experiments on AudioCaps, Clotho, and Auto-ACD demonstrate that our approach achieves state-of-the-art results across these benchmarks.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.