Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data (2305.15971v1)

Published 25 May 2023 in eess.AS

Abstract: Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture. It is a promising approach for streaming applications because it does not incur the extra computation costs of a target speech extraction frontend, which is a critical barrier to quick response. TS-RNNT is trained end-to-end given the input speech (i.e., mixtures and enroLLMent speech) and reference transcriptions. The training mixtures are generally simulated by mixing single-talker signals, but conventional TS-RNNT training does not utilize single-speaker signals. This paper proposes using knowledge distillation (KD) to exploit the parallel mixture/single-talker speech data. Our proposed KD scheme uses an RNNT system pretrained with the target single-talker speech input to generate pseudo labels for the TS-RNNT training. Experimental results show that TS-RNNT systems trained with the proposed KD scheme outperform a baseline TS-RNNT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Takafumi Moriya (30 papers)
  2. Hiroshi Sato (40 papers)
  3. Tsubasa Ochiai (43 papers)
  4. Marc Delcroix (94 papers)
  5. Takanori Ashihara (28 papers)
  6. Kohei Matsuura (26 papers)
  7. Tomohiro Tanaka (37 papers)
  8. Ryo Masumura (28 papers)
  9. Atsunori Ogawa (15 papers)
  10. Taichi Asami (6 papers)
Citations (4)