All Neural Low-latency Directional Speech Extraction (2407.04879v1)

Published 5 Jul 2024 in cs.SD and eess.AS

Abstract: We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to quickly adapt to abrupt changes in DOA.

Citations (1)

View on Semantic Scholar

Summary

The paper presents an end-to-end trainable neural model that integrates time-dependent DOA embeddings within an RNN framework for extracting target speech.
It achieves an algorithmic latency as low as 2ms while maintaining high speech clarity in dynamic, noisy environments.
The model outperforms traditional baselines by leveraging hierarchical DOA embeddings, ensuring robust performance even with DOA estimation errors.

All Neural Low-latency Directional Speech Extraction

The research paper by Ashutosh Pandey et al. introduces a novel all-neural model for low-latency directional speech extraction. This advanced model leverages direction of arrival (DOA) embeddings derived from a predefined spatial grid, which are subsequently transformed and integrated into a recurrent neural network (RNN) based speech extraction framework. This methodology enables effective extraction of speech from a specified DOA, enhancing adaptability in scenarios characterized by rapidly changing acoustic environments.

Introduction

The clarity of speech signals is critical in modern applications, ranging from human-computer interaction to automatic speech recognition systems. Multichannel speech enhancement, which exploits multiple microphones to collect and process spatial and spectral-temporal information, offers a promising avenue to improve overall sound quality by reducing background noise and reverberation. However, separating and enhancing the target speech amidst robust interference remains a challenging problem.

Traditional approaches to directional speech enhancement (DSE) typically involve hand-crafted features or hybrid methods combining neural networks with spatial filters. Previous techniques include Permutation Invariant Training (PIT) for speaker separation and several methods for target speech extraction using auxiliary cues. However, the proposed model diverges from these techniques by employing end-to-end trainable DOA embeddings coupled with an RNN framework.

Proposed Method

Problem Formulation

The paper considers a microphone array with multiple channels in a reverberant space with background noise. The observed multichannel signal can be decomposed into an anechoic mixture of multiple speakers, room reverberation, and additional noise. The task is to extract the target speaker's signal from this mixture using the DOA information of the direct signal from the target speaker.

Model Architecture

The proposed Directional Recurrent Network (DRN) entails a frame-wise processing approach where the input waveform is converted into overlapping frames for spatial and temporal processing by the RNN. The innovation lies in incorporating time-dependent DOA embeddings at multiple stages within the network:

Channel-wise DOA embeddings: These are projected from one-hot encoded azimuth and elevation vectors using separate linear layers for each channel. They are fused within the spatial processing blocks.
Frame-wise DOA embeddings: These are similarly projected but utilize a single linear layer, with the embeddings fused after each LSTM unit.

This hierarchical fusion of DOA embeddings enhances the model's ability to adapt to dynamic real-world scenarios with moving sound sources or receivers. Additionally, the decision to operate in the time domain rather than frequency domain enables significantly reduced latency, making the model suitable for real-time applications.

Experimental Results

The authors conducted extensive evaluations using a dataset generated from the Interspeech2020 DNS Challenge corpus. Several configurations were tested, including variable embedding sizes, the inclusion of azimuth-only versus azimuth-elevation information, and comparison with baseline models.

Key Findings

Performance Metrics: The DRN model demonstrated strong performance in terms of objective measures such as short-time objective intelligibility (STOI), wide-band perceptual evaluation of speech quality (WB-PESQ), signal-to-noise ratio (SNR), and scale-invariant signal-to-distortion ratio (SI-SDR). For instance, the azimuth-elevation embeddings consistently outperformed azimuth-only embeddings, and the model showed improved scores with larger hidden states.
Low-latency Achievements: The model achieved an algorithmic latency of as low as 2ms, significantly lower than traditional methods. This low-latency performance is crucial for applications requiring real-time processing.
Robustness to DOA Mismatch: The model exhibited robustness to DOA estimation errors, maintaining effective performance even with deviations up to ±10 degrees from the ground truth.
Comparisons with Baselines: The DRN model outperformed strong baseline methods, including an oracle Multichannel Wiener Filter (MCWF) and other neural network-based approaches, both in time-domain and frequency-domain settings.

Implications and Future Work

The implications of this research are far-reaching for both theoretical and practical applications. The end-to-end trainable all-neural approach offers flexibility and adaptability, crucial for evolving technologies such as smart glasses and augmented reality. The demonstrated low-latency and robustness make it suitable for real-world deployment in dynamic and noisy environments.

Future work could explore further enhancements, particularly in training with data containing both source and receiver movements, to cater to even more dynamic real-world scenarios. Additionally, optimizing the model for lower compute requirements without sacrificing performance could make it more accessible for on-device applications with limited computational resources.

Conclusion

The paper by Pandey et al. represents a significant advancement in the field of low-latency directional speech extraction. By integrating DOA embeddings within a recurrent neural network, the model effectively isolates target speech amidst interference, demonstrating robust performance metrics and adaptability. This research opens up new pathways for enhancing speech clarity in real-time applications and sets a foundation for future developments in low-latency speech enhancement techniques.

PDF Markdown

Related Papers

Tweets

https://twitter.com/realmofresearch/status/1810538053842378965