Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-end training of time domain audio separation and recognition (1912.08462v3)

Published 18 Dec 2019 in eess.AS, cs.CL, and cs.SD

Abstract: The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Thilo von Neumann (16 papers)
  2. Keisuke Kinoshita (44 papers)
  3. Lukas Drude (13 papers)
  4. Christoph Boeddeker (36 papers)
  5. Marc Delcroix (94 papers)
  6. Tomohiro Nakatani (50 papers)
  7. Reinhold Haeb-Umbach (60 papers)
Citations (34)

Summary

We haven't generated a summary for this paper yet.