Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging Language Gaps in Audio-Text Retrieval (2406.07012v2)

Published 11 Jun 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhiyong Yan (16 papers)
  2. Heinrich Dinkel (29 papers)
  3. Yongqing Wang (29 papers)
  4. Jizhong Liu (4 papers)
  5. Junbo Zhang (84 papers)
  6. Yujun Wang (61 papers)
  7. Bin Wang (750 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com