Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations (2109.00181v1)

Published 1 Sep 2021 in cs.SD and cs.AI

Abstract: Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked LLMing and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hang Li (277 papers)
  2. Yu Kang (61 papers)
  3. Tianqiao Liu (12 papers)
  4. Wenbiao Ding (28 papers)
  5. Zitao Liu (76 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.