Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning (2211.11275v2)

Published 21 Nov 2022 in eess.AS, cs.AI, cs.CL, cs.CV, and cs.SD

Abstract: Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text LLM). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Qiushi Zhu (11 papers)
  2. Long Zhou (57 papers)
  3. Ziqiang Zhang (11 papers)
  4. Shujie Liu (101 papers)
  5. Binxing Jiao (18 papers)
  6. Jie Zhang (846 papers)
  7. Lirong Dai (31 papers)
  8. Daxin Jiang (138 papers)
  9. Jinyu Li (164 papers)
  10. Furu Wei (291 papers)
Citations (30)