Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning (2002.00125v1)

Published 1 Feb 2020 in cs.SD, cs.LG, and eess.AS

Abstract: In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.

Authors (5)

Sanna Wager (6 papers)
Aparna Khare (12 papers)
Minhua Wu (12 papers)
Kenichi Kumatani (15 papers)
Shiva Sundaram (13 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning (2002.00125v1)

Summary

Related Papers