Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition (1912.01679v2)

Published 3 Dec 2019 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shaoshi Ling (8 papers)
  2. Yuzong Liu (12 papers)
  3. Julian Salazar (17 papers)
  4. Katrin Kirchhoff (36 papers)
Citations (136)

Summary

We haven't generated a summary for this paper yet.