Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking (1810.04826v6)

Published 11 Oct 2018 in eess.AS, cs.LG, eess.SP, and stat.ML

Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

Citations (354)

Summary

  • The paper introduces a speaker-conditioned system that isolates a target voice from noisy multi-speaker audio using neural embeddings.
  • It leverages a speaker encoder and spectrogram masking network to reduce noisy WER from 55.9% to 23.4% while preserving single-speaker audio quality.
  • The results demonstrate that targeted voice separation enhances ASR performance, paving the way for robust speech recognition in challenging environments.

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

The paper "VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking" introduces a system designed to isolate the voice of a specific speaker from multi-speaker audio signals. The approach leverages a reference sample from the target speaker, employing two distinct neural networks: a speaker recognition network for generating speaker-discriminative embeddings, and a spectrogram masking network that outputs a mask to separate the target voice from background interference. The proposed system demonstrates a substantial reduction in Word Error Rates (WER) for Automatic Speech Recognition (ASR) systems in noisy conditions, while maintaining minimal degradation in clean, single-speaker scenarios.

Methodology

The VoiceFilter system comprises two primary components:

  1. Speaker Encoder: This component generates speaker embeddings (d-vectors) using a 3-layer LSTM network, trained with a generalized end-to-end loss. It processes audio inputs to output fixed-dimension embeddings, which serve as the basis for speaker discrimination.
  2. Spectrogram Masking Network: The core of the VoiceFilter system, this component utilizes the d-vectors alongside the noisy spectrogram input to create a soft mask. The mask is applied to extract the desired speaker's signal, leveraging time-frequency representations. The architecture includes convolutional layers followed by a uni-directional or bi-directional LSTM, culminating in fully connected layers that enhance the separation precision.

Results

The VoiceFilter's effectiveness is evaluated using two primary metrics: WER and Source to Distortion Ratio (SDR). The results indicate:

  • For audio derived from the LibriSpeech dataset, the application of VoiceFilter reduced the noisy WER from 55.9% to 23.4%, while the clean WER remained largely unaffected.
  • The system's performance, as measured by SDR, confirms its superiority over conventional permutation invariant loss-based approaches, showcasing its higher capability in isolating target audio effectively.

Implications and Future Directions

The implications of VoiceFilter are significant for ASR in complex audio environments, where multiple speakers and background noise degrade transcription accuracy. The system not only improves WER in challenging conditions but does so without prior knowledge of the number of speakers, addressing a critical limitation of classical source separation techniques.

Future research could explore several avenues to enhance the VoiceFilter model:

  • Dataset Expansion: Incorporating larger and more diverse datasets, such as VoxCeleb 1 and 2, could refine the model’s robustness across varied acoustic settings.
  • Multi-Speaker and Noise Separation: Extending the system to handle more interfering speakers and simultaneous noise reduction could broaden its applicability.
  • Joint System Training: Training the VoiceFilter alongside ASR systems may yield further improvements in WER, fostering a more integrated approach to speech separation and recognition.

In summary, the methodology and results of the VoiceFilter system present a notable advancement in the domain of targeted voice separation, with practical applications in enhancing ASR performance amidst multi-speaker interference.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com