Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data (2305.15971v1)
Abstract: Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture. It is a promising approach for streaming applications because it does not incur the extra computation costs of a target speech extraction frontend, which is a critical barrier to quick response. TS-RNNT is trained end-to-end given the input speech (i.e., mixtures and enroLLMent speech) and reference transcriptions. The training mixtures are generally simulated by mixing single-talker signals, but conventional TS-RNNT training does not utilize single-speaker signals. This paper proposes using knowledge distillation (KD) to exploit the parallel mixture/single-talker speech data. Our proposed KD scheme uses an RNNT system pretrained with the target single-talker speech input to generate pseudo labels for the TS-RNNT training. Experimental results show that TS-RNNT systems trained with the proposed KD scheme outperform a baseline TS-RNNT.
- Takafumi Moriya (30 papers)
- Hiroshi Sato (40 papers)
- Tsubasa Ochiai (43 papers)
- Marc Delcroix (94 papers)
- Takanori Ashihara (28 papers)
- Kohei Matsuura (26 papers)
- Tomohiro Tanaka (37 papers)
- Ryo Masumura (28 papers)
- Atsunori Ogawa (15 papers)
- Taichi Asami (6 papers)