Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding (2406.13275v2)

Published 19 Jun 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in LLMs, with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jizhong Liu (4 papers)
  2. Gang Li (579 papers)
  3. Junbo Zhang (84 papers)
  4. Heinrich Dinkel (29 papers)
  5. Yongqing Wang (29 papers)
  6. Zhiyong Yan (16 papers)
  7. Yujun Wang (61 papers)
  8. Bin Wang (750 papers)

Summary

We haven't generated a summary for this paper yet.