Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining (2404.17806v1)

Published 27 Apr 2024 in cs.SD, cs.LG, eess.AS, and cs.CL

Abstract: Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use LLMs~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yi Yuan (54 papers)
  2. Zhuo Chen (319 papers)
  3. Xubo Liu (66 papers)
  4. Haohe Liu (59 papers)
  5. Xuenan Xu (29 papers)
  6. Dongya Jia (18 papers)
  7. Yuanzhe Chen (19 papers)
  8. Mark D. Plumbley (114 papers)
  9. Wenwu Wang (148 papers)
Citations (5)