Speaker Distance Estimation in Enclosures from Single-Channel Audio (2403.17514v1)
Abstract: Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the obtained sound source position. Towards this direction, in this paper we propose a novel approach for continuous distance estimation from audio signals using a convolutional recurrent neural network with an attention module. The attention mechanism enables the model to focus on relevant temporal and spectral features, enhancing its ability to capture fine-grained distance-related information. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using audio recordings in controlled environments with three levels of realism (synthetic room impulse response, measured response with convolved speech, and real recordings) on four datasets (our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental results show that the model achieves an absolute error of 0.11 meters in a noiseless synthetic scenario. Moreover, the results showed an absolute error of about 1.30 meters in the hybrid scenario. The algorithm's performance in the real scenario, where unpredictable environmental factors and noise are prevalent, yields an absolute error of approximately 0.50 meters. For reproducible research purposes we make model, code, and synthetic datasets available at https://github.com/michaelneri/audio-distance-estimation.
- M. Wölfel and J. W. McDonough, Distant speech recognition, Wiley, 2009.
- “A Linear Neural Network-Based Approach to Stereophonic Acoustic Echo Cancellation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1743–1753, 2011.
- E. Berglund and J. Sitte, “Sound source localisation through active audition,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2005, pp. 653–658.
- T. Rodemann, “A study on distance estimation in binaural sound localization,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2010.
- A. Brendel and W. Kellermann, “Distance estimation of acoustic sources using the coherent-to-diffuse power ratio based on distributed training,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018.
- J. H. DiBiase, A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays, Brown University Providence, RI, 2000.
- M. Yiwere and E. J. Rhee, “Distance Estimation and Localization of Sound Sources in Reverberant Conditions using Deep Neural Networks,” in International Journal of Applied Engineering Research, 2017.
- “Joint direction and proximity classification of overlapping sound events from binaural audio,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021.
- “Distance-Based Sound Separation,” in Interspeech, 2022.
- “Sound Source Distance Estimation in Rooms based on Statistical Properties of Binaural Signals,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 8, pp. 1727–1741, 2013.
- “Speaker distance detection using a single microphone,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 1949–1961, 2011.
- J. K. Nielsen, “Loudspeaker and listening position estimation using smart speakers,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
- “Dsp-based audio processing for controlling a mobile robot using a spherical microphone array,” in IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012.
- “2D sound source position estimation using microphone arrays and its application to a VR-based bird song analysis system,” Advanced Robotics, vol. 33, no. 7-8, pp. 403–414, 2019.
- “Position Estimation of Sound Source Using Three Optical Mach-Zehnder Acoustic Sensor Array,” Curr. Opt. Photon., vol. 1, no. 6, pp. 573–578, 2017.
- “Position estimation of binaural sound source in reverberant environments,” Egyptian Informatics Journal, vol. 18, no. 2, pp. 87–93, 2017.
- Y.C. Lu and M. Cooke, “Binaural Estimation of Sound Source Distance via the Direct-to-Reverberant Energy Ratio for Static and Moving Sources,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1793–1805, 2010.
- “On room impulse response between arbitrary points: An efficient parameterization,” in 6th International Symposium on Communications, Control and Signal Processing (ISCCSP), 2014.
- S. Vesa, “Sound Source Distance Learning Based on Binaural Signals,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2007.
- S. Vesa, “Binaural Sound Source Distance Learning in Rooms,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 8, pp. 1498–1507, 2009.
- “Supervised Learning-based Sound Source Distance Estimation Using Multivariate Features,” in IEEE Region 10 Symposium (TENSYMP), 2021.
- M. Yiwere and E. J. Rhee, “Sound source distance estimation using deep learning: An image classification approach,” Sensors, vol. 20, no. 1, pp. 172, 2019.
- “Few-Shot Sound Source Distance Estimation Using Relation Networks,” arXiv:2109.10561, 2021.
- R. Venkatesan and A.B. Ganesh, “Analysis of monaural and binaural statistical properties for the estimation of distance of a target speaker,” Circuits, Systems, and Signal Processing, vol. 39, pp. 3626–3651, 2020.
- “Speaker Distance Estimation from Single Channel Audio in Reverberant Environments,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023.
- “Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2019.
- “Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network,” in Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
- D. Byrne, “The speech spectrum-some aspects of its significance for hearing aid selection and evaluation,” British Journal of Audiology, vol. 11, no. 2, pp. 40–46, 1977.
- “Multi-channel Environmental Sound Segmentation utilizing Sound Source Localization and Separation U-Net,” in 2021 IEEE/SICE International Symposium on System Integration (SII), 2021.
- “Drone Audition: Sound Source Localization Using On-Board Microphones,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 508–519, 2022.
- “Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1778–1787, 2020.
- A. Pandey and D. Wang, “Exploring deep complex networks for complex spectrogram enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6885–6889.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning (ICML). pmlr, 2015.
- “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” in International Conference on Learning Representations (ICLR), 2016.
- “Light Gated Recurrent Units for Speech Recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92–102, 2018.
- “Binaural source localization using deep learning and head rotation information,” in 30th European Signal Processing Conference (EUSIPCO), 2022, pp. 36–40.
- “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” Linguistic Data Consortium, 1993.
- A. Politis, Microphone array processing for parametric spatial audio techniques, Doctoral thesis, School of Electrical Engineering, Aalto University, 2016.
- “Sound absorption coefficient chart: JCW acoustic supplies,” https://www.acoustic-supplies.com/absorption-coefficient-chart, Accessed: 2023-06-17.
- “WHAM!: Extending Speech Separation to Noisy Environments,” in Interspeech, 2019.
- R. Stewart and M. Sandler, “Database of omnidirectional and b-format room impulse responses,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 165–168.
- “VoiceHome-2, an extended corpus for multichannel speech processing in real homes,” Speech Communication, vol. 106, pp. 68–78, 2019.
- “Starss22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France, November 2022.
- “Starss23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” arXiv preprint arXiv:2306.09126, 2023.
- “Binaural sound source distance estimation and localization for a moving listener,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 996–1011, 2024.
- F. Jacobsen and T. Roisin, “The coherence of reverberant sound fields,” The Journal of the Acoustical Society of America, vol. 108, no. 1, pp. 204–210, 2000.
- “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, 2011.
- A. Kumar and B. Raj, “Audio event and scene recognition: A unified approach using strongly and weakly labeled data,” in International Joint Conference on Neural Networks (IJCNN), 2017.
- “Sound Event Detection for Human Safety and Security in Noisy Environments,” IEEE Access, vol. 10, pp. 134230–134240, 2022.
- I. Martín-Morató and A. Mesaros, “Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 902–914, 2023.
- “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.