Deep Neural Networks for Multiple Speaker Detection and Localization (1711.11565v3)

Published 30 Nov 2017 in cs.SD, cs.AI, cs.MM, cs.RO, and eess.AS

Abstract: We propose to use neural networks for simultaneous detection and localization of multiple sound sources in human-robot interaction. In contrast to conventional signal processing techniques, neural network-based sound source localization methods require fewer strong assumptions about the environment. Previous neural network-based methods have been focusing on localizing a single sound source, which do not extend to multiple sources in terms of detection and localization. In this paper, we thus propose a likelihood-based encoding of the network output, which naturally allows the detection of an arbitrary number of sources. In addition, we investigate the use of sub-band cross-correlation information as features for better localization in sound mixtures, as well as three different network architectures based on different motivations. Experiments on real data recorded from a robot show that our proposed methods significantly outperform the popular spatial spectrum-based approaches.

Citations (175)

View on Semantic Scholar

Summary

The paper proposes deep neural network (DNN) methodologies for multiple speaker detection and localization that relax strict environmental assumptions required by conventional methods.
Novel contributions include a likelihood-based encoding for DNN outputs to handle an indefinite number of sources and the use of GCC-PHAT features with architectures like MLP, CNN, and a two-stage network.
Experiments using a robot in controlled and realistic settings show the proposed DNN methods significantly improve accuracy and reduce localization error compared to traditional spatial spectrum techniques.

Deep Neural Networks for Multiple Speaker Detection and Localization: An Academic Overview

The complex issue of detecting and localizing multiple simultaneous sound sources in human-robot interaction (HRI) is addressed by the paper "Deep Neural Networks for Multiple Speaker Detection and Localization" written by Weipeng He, Petr Motlicek, and Jean-Marc Odobez. This work contributes to the field by presenting neural network-based methodologies that relax the environmental assumptions typically required by conventional signal processing techniques.

Problem Definition and Motivation

Traditional sound source localization (SSL) methods rely on predefined assumptions about the noise environment and signal characteristics, which do not align well with the diverse and dynamic conditions encountered in real-world HRI scenarios. These scenarios are characterized by high noise levels from robot motors, varying numbers of simultaneous speakers, short and low-energy utterances, and complex acoustic obstacles. Previous approaches primarily focused on single-source localization, whereas this paper addresses the more challenging task of multiple sound source detection and localization.

Proposed Methodologies

In stark contrast to existing methods, the authors propose a novel likelihood-based encoding of neural network outputs, allowing for the identification and localization of an indefinite number of sound sources without necessitating pooling across time frames or requiring predefined source numbers.

Neural Network Architectures:
- MLP-GCC: Utilizes multilayer perceptrons with GCC-PHAT features.
- CNN-GCCFB: Implements convolutional neural networks with a GCC-PHAT on a mel-scale filter bank for input data.
- TSNN-GCCFB: A two-stage network designed to leverage sub-band correlation data more effectively for SSL purposes.
Input Features:
- The adoption of generalized cross-correlation (GCC) with phase transform (PHAT) across sub-band filters provides a foundation for these architectures, enabling them to better manage overlapping speech signals by capturing TDOA (time difference of arrival) per frequency band.
Output Encoding:
- The introduction of a Gaussian-shaped likelihood output coding provides a robust solution for peak detection representing the direction-of-arrival of sound sources, enhancing the ability to manage multiple simultaneous sound signals.

Experimental Validation

The research conducted extensive experimentation using recordings from a robot (Pepper) in both controlled (with loudspeakers) and realistic (with human subjects) sound environments. The results underscore the effectiveness of the proposed methods in conditions with one or more concurrent sound sources, exhibiting significant improvements in accuracy and mean absolute error compared to traditional spatial spectrum-based methods like SRP-PHAT and MUSIC.

Performance Metrics: Higher precision and recall were observed for the new approaches compared to baseline methods, demonstrating both improved localization accuracy and the ability to detect various numbers of sound sources efficiently.

Implications and Future Directions

From a practical perspective, the work equips robots with enhanced capabilities for multi-party interactions in challenging acoustic environments, broadening their applicability in social and service roles. Theoretically, this research paves the way for further development of generalized machine learning approaches in SSL, urging the consideration of reduced training datasets and increased generalization capabilities.

Future research could look into more sophisticated models that can generalize well with limited training samples, expand the noise environments they can handle, and explore integration with temporal context information for improved robustness in real-time applications.

Related Papers

YouTube

Show All Videos