End-to-end training of time domain audio separation and recognition (1912.08462v3)

Published 18 Dec 2019 in eess.AS, cs.CL, and cs.SD

Abstract: The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

Authors (7)

Thilo von Neumann (16 papers)
Keisuke Kinoshita (44 papers)
Lukas Drude (13 papers)
Christoph Boeddeker (36 papers)
Marc Delcroix (94 papers)
Tomohiro Nakatani (50 papers)
Reinhold Haeb-Umbach (60 papers)

Citations (34)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

End-to-end training of time domain audio separation and recognition (1912.08462v3)

Summary

Related Papers