Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation (2011.13393v2)

Published 26 Nov 2020 in cs.SD and eess.AS

Abstract: Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17% relative Character Error Rate (CER) on multi-speaker signals with background noise. The multi-condition experiments indicate that our method can achieve 9% relative performance gain in the noisy condition while maintaining the performance in the clean condition.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiatong Shi (82 papers)
  2. Chunlei Zhang (40 papers)
  3. Chao Weng (61 papers)
  4. Shinji Watanabe (416 papers)
  5. Meng Yu (64 papers)
  6. Dong Yu (328 papers)
Citations (10)