Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts (2211.06493v2)
Abstract: Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs. First, while a larger model improves the SS performance, it also requires a higher computational cost. Second, an SS model that is more optimized for handling overlapped speech is likely to introduce more processing artifacts in non-overlapped-speech regions. In this paper, we address these trade-offs with a sparsely-gated mixture-of-experts (MoE) architecture. Comprehensive evaluation results obtained using both simulated and real meeting recordings show that our proposed sparsely-gated MoE SS model achieves superior separation capabilities with less speech distortion, while involving only a marginal run-time cost increase.
- Xiaofei Wang (138 papers)
- Zhuo Chen (319 papers)
- Yu Shi (153 papers)
- Jian Wu (314 papers)
- Naoyuki Kanda (61 papers)
- Takuya Yoshioka (77 papers)