Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages (2205.01086v1)

Published 2 May 2022 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Felix Wu (30 papers)
  2. Kwangyoun Kim (18 papers)
  3. Shinji Watanabe (416 papers)
  4. Kyu Han (5 papers)
  5. Ryan McDonald (24 papers)
  6. Kilian Q. Weinberger (105 papers)
  7. Yoav Artzi (51 papers)
Citations (33)
Youtube Logo Streamline Icon: https://streamlinehq.com