Deep attractor network for single-microphone speaker separation (1611.08930v2)

Published 27 Nov 2016 in cs.SD and cs.LG

Abstract: Despite the overwhelming success of deep learning in various speech processing tasks, the problem of separating simultaneous speakers in a mixture remains challenging. Two major difficulties in such systems are the arbitrary source permutation and unknown number of sources in the mixture. We propose a novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source. Attractor points in this study are created by finding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. The proposed model is different from prior works in that it implements an end-to-end training, and it does not depend on the number of sources in the mixture. Two strategies are explored in the test time, K-means and fixed attractor points, where the latter requires no post-processing and can be implemented in real-time. We evaluated our system on Wall Street Journal dataset and show 5.49\% improvement over the previous state-of-the-art methods.

Citations (400)

View on Semantic Scholar

Summary

The paper presents a deep attractor network that forms centroids in high-dimensional embedding space to overcome permutation and variable source challenges.
It utilizes a multi-layer Bi-LSTM architecture that minimizes L2 reconstruction error, achieving a 5.49% SDR improvement on the WSJ dataset.
The framework enables real-time application potential, offering scalable and robust AI-driven speech separation in complex audio environments.

Deep Attractor Network for Single-Microphone Speaker Separation: An Academic Analysis

This paper introduces a deep attractor network, an innovative framework designed to address the challenges of single-microphone speaker separation in deep learning applications. The authors identify two predominant issues in current methodologies: the arbitrary source permutation problem and the variable number of sources problem. The paper proposes an embedding-based approach that identifies and leverages attractor points in high-dimensional space to improve speaker separation, notably surpassing previous benchmarks.

Core Contributions

The proposed network leverages a method of creating attractor points in the embedding space to anchor time-frequency bins of each acoustic source in a mixture. This attraction mechanism fosters an end-to-end learning model that can effectively manage variably numbered sources without explicit reliance on predefined permutations. Key aspects of this model include:

Attractor Point Formation: The network identifies centroids in the embedding space, serving as anchors for each source. This mechanism allows for flexibility and scalability in handling multiple sources without the permutation problem, a significant limitation in traditional models.
End-to-End Training: By minimizing reconstruction error directly, the framework improves computational efficiency and separation performance, in contrast to staged training processes.
Dynamic Adaptability: The network successfully distinguishes between a range of different source combinations, by not pre-fixing the number of sources it can handle.

Methodological Framework

The deep attractor network achieves this functionality via a multi-layer Bi-directional LSTM architecture with approximately 600 hidden units per layer. The system utilizes time-frequency embeddings to construct masks assimilated into spectrogram space, optimizing source signal recovery by minimizing the L2 reconstruction error between estimated and clean signals.

Two operational strategies during testing include K-means clustering for fine-tuning attractor locations or fixed attractor points. Notably, the latter approach showed potential for real-time implementation, a critical consideration for practical applications.

Comparative Analysis and Results

Quantitative assessments, conducted on the Wall Street Journal dataset, indicate a performance improvement of 5.49% in SDR over previous methodologies. Comparisons against deep clustering (DC) and permutation invariant training (PIT) models underscored the attractor network's proficiency in effectively managing permutation ambiguities and varying source counts.

Empirical Insights:

The network outperformed DC in a two-speaker separation task.
A remarkable gain was noted in the three-speaker problem setting, revealing the constraints of binary masks predominantly used in DC models under complex conditions where sources significantly differ in magnitude.

Implications and Future Directions

This work provides a path forward for AI-driven speech separation technologies. The demonstrated benefits and efficiencies of the deep attractor network imply robust applicability in real-time and scalable sound processing systems, promising enhancements in tasks such as speech recognition in dynamic environments.

Looking ahead, the potential to adapt to more complex audio mixtures necessitates an exploration of hierarchical clustering approaches and further testing under varied conditions to validate attractor stability and system robustness. Investigating additional embedding space metrics or leveraging attention mechanism-like frameworks could further augment system adaptability and performance.

This paper represents a meaningful step in advancing single-microphone speaker separation technology, establishing a framework that other researchers in the field should consider for future explorations and practical implementations.

PDF Markdown