Speaker-independent Speech Separation with Deep Attractor Network (1707.03634v3)

Published 12 Jul 2017 in cs.SD and cs.LG

Abstract: Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture permutation problem, and the second is the unknown number of speakers in the mixture output dimension problem. We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space. A reference point attractor is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space. The time-frequency embeddings of each speaker are then forced to cluster around the corresponding attractor point which is used to determine the time-frequency assignment of the speaker. We propose three methods for finding the attractors for each source in the embedding space and compare their advantages and limitations. The objective function for the network is standard signal reconstruction error which enables end-to-end operation during both training and test phases. We evaluated our system using the Wall Street Journal dataset WSJ0 on two and three speaker mixtures and report comparable or better performance than other state-of-the-art deep learning methods for speech separation.

Citations (241)

View on Semantic Scholar

Summary

The paper presents DANet, which maps time-frequency embeddings to a high-dimensional attractor space to address the permutation problem in speaker-independent separation.
The methodology employs dynamic, fixed, and anchored attractor estimation techniques to optimize clustering of speaker-specific embeddings without additional post-processing.
Results on WSJ0 demonstrate that DANet achieves comparable or superior SDR improvements over state-of-the-art methods, paving the way for robust audio separation.

Speaker-independent Speech Separation with Deep Attractor Network

The paper presents a novel deep learning framework known as the Deep Attractor Network (DANet) for performing speaker-independent speech separation using single-microphone audio recordings. This framework addresses two main challenges faced in the separation process: the permutation problem, which arises from the arbitrary order of target and masker speakers in a mixture, and the output dimension problem, which results from the unknown number of speakers in a mixture.

Overview

To address these challenges, the authors propose mapping the time-frequency (T-F) representation of mixture signals into a high-dimensional embedding space. Within this space, each speaker is represented by an "attractor," which is essentially the centroid of this speaker's embeddings. The network forces the T-F embeddings corresponding to each speaker to cluster around these attractor points. This clustering facilitates the assignment of T-F bins to specific speakers and eliminates the need for post-clustering steps, which are typically required in conventional deep learning methods, such as Deep Clustering (DPCL) or Permutation Invariant Training (PIT).

Methodology

The paper introduces several strategies for forming attractor points. During the training phase, attractors are dynamically generated using true speaker assignments, allowing the network to cluster embeddings towards the calculated centroids. For testing, three methods for attractor estimation are evaluated:

Clustering-based Estimation: Utilizing unsupervised clustering methods, such as K-means, to estimate attractor positions based on the embeddings.
Fixed Point Estimation: This approach pre-calculates attractors during training and assumes fixed locations for attractors during testing.
Anchored Attractor Network (ADANet): Introduces the use of reference points, or anchors, in the embedding space, which are trained to allow simultaneous optimization of speaker assignment and mask generation processes. This technique resolves training-testing mismatch, simplifies post-processing, and supports flexible handling of mixtures with different numbers of speakers.

Results

The framework was evaluated using the Wall Street Journal (WSJ0) dataset for both two- and three-speaker mixtures. The results show that DANet achieves comparable or superior performance compared to state-of-the-art methods such as DPCL and uPIT-BLSTM. Specifically, DANet and its variants exhibited notable performance improvements in terms of Signal-to-Distortion Ratio (SDR) and other audio quality metrics.

The Anchored DANet exhibited improved robustness and achieved significant gains in scenarios involving variable speaker counts without prior knowledge of the number of speakers. This aspect is particularly beneficial for applications where the number of concurrent speakers is dynamic, as often encountered in real-world settings.

Implications and Future Directions

The introduction of DANet represents a significant step towards more robust and flexible speaker separation systems. The exploitation of dynamic attractors within a high-dimensional embedding space optimizes speaker separation without cumbersome post-processing, setting a strong foundation for future work in source separation tasks.

Looking ahead, exploring the application of Danet's principles to broader audio source separation problems, including music and environmental sound separation, offers promising avenues for growth. Additionally, integrating speaker information within the attractor estimation process could further enhance speaker-specific separation capabilities. As source separation evolves, the adaptability and efficiency of frameworks like DANet are likely to drive forward both theoretical advancements and practical applications, heralding improved audio quality in complex acoustic environments.

PDF Markdown