Deep clustering: Discriminative embeddings for segmentation and separation (1508.04306v1)

Published 18 Aug 2015 in cs.NE, cs.LG, and stat.ML

Abstract: We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.

Citations (1,274)

View on Semantic Scholar

Summary

The paper introduces deep clustering, which uses discriminative spectrogram embeddings to improve acoustic source separation.
It employs a BLSTM network that minimizes intra-cluster distances and maximizes inter-cluster differences, achieving around 6 dB improvement in two-speaker mixtures.
The method generalizes effectively to three-speaker and varied gender mixtures, suggesting broad applicability in audio and other segmentation tasks.

Deep Clustering: Discriminative Embeddings for Segmentation and Separation

The paper "Deep Clustering: Discriminative Embeddings for Segmentation and Separation" presents an innovative approach to the problem of acoustic source separation. The authors introduce a method called "deep clustering," which utilizes deep learning to generate spectrogram embeddings. These embeddings are discriminative for partition labels provided in training data. Unlike previous deep learning methods that directly estimate signals or masking functions, this approach focuses on embedding generation, thereby enhancing the flexibility and generality of separating signals in a class-independent manner.

Core Contributions

The core innovation resides in the learning of discriminative embeddings that facilitate clustering. Traditional spectral clustering methods offer flexibility concerning the classes and number of segments but are computationally intensive. This research integrates the adaptability of spectral clustering with the high learning efficiency of deep networks by employing an objective function that trains embeddings to approximate an ideal pairwise affinity matrix. This approach results in compactly clustered embeddings which can be easily separated using simple clustering methods. The segmentations inherently encoded in the embeddings can be effectively "decoded" through clustering.

Experimental Highlights

The experimental setup leverages speech mixtures based on the Wall Street Journal (WSJ0) corpus, creating a dataset that is more challenging than existing ones due to its inclusion of two-speaker and three-speaker mixtures. The authors trained a deep neural network containing two bi-directional long short-term memory (BLSTM) layers followed by a feedforward layer. The network's objective is to minimize the distance between embeddings within the same partition while maximizing the distance between embeddings from different partitions.

Key Results

Speaker Separation: The proposed method achieved around 6 dB signal quality improvement for separating two-speaker mixed speech. It also demonstrated the capability to generalize to three-speaker mixtures despite training solely on two-speaker mixtures.
Embedding Dimensions: The performance was robust across various embedding dimensions, with a dimension of 40 showing optimal results.
Clustering Methods: Various clustering methods were evaluated, including $k$ -means and spectral clustering. Oracle permutations resolved the permutation problem within segments, showcasing the method's potential in practical applications.
Gender-Based Mixtures: The method effectively handled mixtures of same-gender and different-gender speakers, with better performance observed in different-gender mixtures due to intrinsic separation difficulties in same-gender mixtures.

Implications and Future Directions

The implications of this research are significant both in theory and practice:

Theoretical Contributions: The low-rank approximation of the ideal affinity matrix through learned embeddings offers a novel perspective on integrating deep learning with traditional clustering methods. This approach not only optimizes computational efficiency but also extends the generalization capability of the model beyond predefined classes.
Practical Applications: The practical applications extend beyond speech separation to include image segmentation and other domains where the number of sources or objects is not fixed. This versatility can significantly impact fields such as auditory scene analysis, biomedical signal processing, and multi-object tracking.
Scalability: The method's ability to generalize to scenarios with an unknown number of sources is particularly promising. This scalability makes it suitable for real-world applications where the number and types of sources can vary significantly.

Future research directions may include optimizing the integration of the clustering step within the network, which could lead to further improvements through joint training. Additionally, exploring alternative network architectures, such as deep convolutional neural networks or hierarchical recursive embedding networks, could enhance the learning and application breadth of deep clustering. Extending the training to include a broader range of audio types and testing the method’s applicability to image segmentation could further validate and expand its use cases.

Conclusion

The paper presents a compelling approach to acoustic source separation via discriminative embeddings for deep clustering. It demonstrates the method's effectiveness through substantial experimental results, highlighting its robustness and generalization capabilities. The integration of clustering into deep learning frameworks offers both efficiency and flexibility, paving the way for extensive research and practical applications in various domains.

PDF Markdown