Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation (1607.00325v2)

Published 1 Jul 2016 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from most of the prior arts that treat speech separation as a multi-class regression problem and the deep clustering technique that considers it a segmentation (or clustering) problem, our model optimizes for the separation regression error, ignoring the order of mixing sources. This strategy cleverly solves the long-lasting label permutation problem that has prevented progress on deep learning based techniques for speech separation. Experiments on the equal-energy mixing setup of a Danish corpus confirms the effectiveness of PIT. We believe improvements built upon PIT can eventually solve the cocktail-party problem and enable real-world adoption of, e.g., automatic meeting transcription and multi-party human-computer interaction, where overlapping speech is common.

Authors (4)

Dong Yu (329 papers)
Morten Kolbæk (8 papers)
Zheng-Hua Tan (85 papers)
Jesper Jensen (41 papers)

Citations (826)

View on Semantic Scholar

Summary

Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation

The paper "Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation" by Dong Yu et al. addresses the challenging problem of separating mixed speech signals from multiple speakers, also referred to as the cocktail-party problem. The authors propose a novel training criterion named Permutation Invariant Training (PIT) which circumvents the label permutation problem and demonstrates superiority over previous methods.

Background and Related Work

The task of speech separation has historically been approached through various means such as Computational Auditory Scene Analysis (CASA) and Non-negative Matrix Factorization (NMF). CASA operates based on low-level features to estimate time-frequency masks which isolate components belonging to different speakers. NMF, on the other hand, estimates mixing factors using learned non-negative bases. However, these traditional methods have seen limited success.

Recent advancements in single-talker Automatic Speech Recognition (ASR) have amplified interest in using deep learning for speech separation. Approaches such as Deep Clustering (DPCL) and multi-class regression models have made substantial progress but still face challenges, particularly with speaker-independent scenarios due to the label permutation problem. The DPCL method maps time-frequency bins to an embedding space, where clustering is used to generate partitions. While effective, it imposes assumptions that are sub-optimal and complicates integration with other techniques.

Permutation Invariant Training (PIT)

PIT addresses the label permutation problem directly by considering the speech separation as a minimization of separation error. In a PIT framework, during training, all possible assignments between the reference source streams and the network outputs are evaluated. The assignment with the lowest total Mean Squared Error (MSE) is chosen. This approach allows the network to correctly align its output with the appropriate source streams even under varying conditions.

Experimentation shows that PIT significantly reduces the MSE on both the training and validation datasets compared to conventional methods. The PIT framework's flexibility enables it to integrate easily with other advanced techniques, potentially solving the cocktail-party problem.

Experimental Results

The authors evaluated PIT on the WSJ0-2mix and Danish-2mix datasets, focusing primarily on the WSJ0-2mix to allow for direct comparison with prior works. Various configurations of feed-forward Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) were tested.

The Signal-to-Distortion Ratio (SDR) improvements demonstrated that PIT outperforms CASA, NMF, and DPCL in both closed conditions (seen speakers) and open conditions (unseen speakers). Notably, PIT with CNN architecture achieved an SDR improvement of 10.9 dB in optimal assignment, approaching the oracle ideal ratio mask (IRM) of 12.5 dB.

Furthermore, experiments indicated that PIT trained on Danish-2mix generalized well to the non-English WSJ0 dataset, highlighting its cross-linguistic applicability. These results suggest that PIT learns acoustic cues that are largely invariant to speaker and language differences.

Practical and Theoretical Implications

The use of PIT in multi-talker speech separation holds significant promise for applications in automatic meeting transcription, closed-captioning for recordings, and multi-party human-machine interactions. By effectively addressing the label permutation problem, PIT can enhance ASR systems' robustness in real-world scenarios where overlapping speech is common.

Theoretically, PIT introduces a more direct approach to minimizing separation error, challenging previous paradigms that framed the problem as multi-class regression or clustering. This paradigm shift could inspire further research into training criteria that align more closely with the problem's inherent structure.

Future Directions

Potential future research directions include:

Speaker Tracing Algorithms: Enhancing PIT with sophisticated speaker tracing algorithms could close the gap between optimal and default assignments, especially in cases with frequent speaker assignment changes.
Model Enhancements: Exploring more advanced deep learning architectures, such as bidirectional LSTMs or CNNs with deconvolution layers, could yield further performance gains.
Complex-Domain Separation: Integrating PIT with techniques that exploit complex-domain separation could improve the reconstruction of source streams.
Universal Models: Developing universal models trained on diverse datasets encompassing various speakers, languages, and noise conditions could establish robust, generalized speech separation systems.
Multi-channel Integration: Extending PIT to multi-channel setups and combining it with beamforming could leverage spatial information, enhancing separation performance.

In conclusion, Permutation Invariant Training presents a compelling approach to the cocktail-party problem, combining effectiveness, simplicity, and flexibility. Its ability to generalize across speakers and languages marks a significant step forward in the field of speech separation.

PDF Markdown

Related Papers

Find Related Papers