Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interaural time difference loss for binaural target sound extraction (2408.00344v1)

Published 1 Aug 2024 in cs.SD and eess.AS

Abstract: Binaural target sound extraction (TSE) aims to extract a desired sound from a binaural mixture of arbitrary sounds while preserving the spatial cues of the desired sound. Indeed, for many applications, the target sound signal and its spatial cues carry important information about the sound source. Binaural TSE can be realized with a neural network trained to output only the desired sound given a binaural mixture and an embedding characterizing the desired sound class as inputs. Conventional TSE systems are trained using signal-level losses, which measure the difference between the extracted and reference signals for the left and right channels. In this paper, we propose adding explicit spatial losses to better preserve the spatial cues of the target sound. In particular, we explore losses aiming at preserving the interaural level (ILD), phase (IPD), and time differences (ITD). We show experimentally that adding such spatial losses, particularly our newly proposed ITD loss, helps preserve better spatial cues while maintaining the signal-level metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. “Listen to What You Want: Neural Network-Based Universal Sound Selector,” in Proc. Interspeech, 2020, pp. 1441–1445.
  2. “Neural target speech extraction: An overview,” IEEE Signal Process. Mag., vol. 40, no. 3, pp. 8–29, 2023.
  3. “Soundbeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 121–136, 2023.
  4. “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
  5. “One-shot conditional audio filtering of arbitrary sounds,” in Proc. ICASSP, 2021, pp. 501–505.
  6. “The sound of pixels,” in Proc. Computer Vision - ECCV. 2018, vol. 11205, pp. 587–604, Springer.
  7. “Environmental sound extraction using onomatopoeic words,” in Proc. ICASSP, 2022, pp. 221–225.
  8. Rongzhi Gu and Yi Luo, “Rezero: Region-customizable sound extraction,” IEEE ACM Trans. on Audio, Speech, and Lang. Proc., vol. 32, pp. 2576–2589, 2024.
  9. “Target sound extraction with variable cross-modality clues,” in Proc. ICASSP, 2023.
  10. “Real-time target sound extraction,” in Proc. ICASSP.
  11. Jens Blauert, “Spatial hearing: The psychophysics of human sound localization,” Spatial hearing, 1997.
  12. David R Moore, “Anatomy and physiology of binaural hearing,” Audiology, vol. 30, no. 3, pp. 125–134, 1991.
  13. “Real-time binaural speech separation with preserved spatial cues,” in Proc. ICASSP, 2020, pp. 6404–6408.
  14. “Semantic hearing: Programming acoustic scenes with binaural hearables,” in Proc. Symposium on User Interface Software and Technology (UIST). 2023, pp. 89:1–89:15, ACM.
  15. “Binaural speech enhancement using deep complex convolutional transformer networks,” in Proc. ICASSP. 2024, pp. 681–685, IEEE.
  16. “Binaural multi-channel wiener filtering for hearing aids: Preserving interaural time and level differences,” in Proc. ICASSP, 2006, vol. 5, pp. V–V.
  17. “SDR–half-baked or well done?,” in Proc. ICASSP, 2019, pp. 626–630.
  18. “A comparative study of the LMS adaptive filter versus generalized correlation method for time delay estimation,” in Proc. ICASSP, 1984, pp. 652–655.
  19. “FSD50K: an open dataset of human-labeled sound events,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 829–852, 2022.
  20. Karol J. Piczak, “ESC: dataset for environmental sound classification,” in Proc. Annual ACM Conference on Multimedia. 2015, pp. 1015–1018, ACM.
  21. “The MUSDB18 corpus for music separation,” Dec. 2017.
  22. “Dnn-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays,” IEEE ACM Trans. on Audio, Speech, and Lang. Proc., vol. 29, pp. 2310–2323, 2021.
  23. “A multi-device dataset for urban acoustic scene classification,” in Proc. DCASE, 2018, pp. 9–13.
  24. “The CIPIC HRTF database,” in Proc. WASPAA, 2001, pp. 99–102.
  25. “The Salford BBC spatially-sampled binaural room impulse response dataset,” 2015.
  26. IoSR-Surrey, “Iosr-surrey/realroombrirs: Binaural impulse responses captured in real rooms,” https://github.com/IoSR-Surrey/RealRoomBRIRs, 2016.
  27. IoSR-Surrey, “Simulated room impulse responses,” https://iosr.uk/software/index.php, 2023, Accessed: April 24, 2024.
  28. “Scaper: A library for soundscape synthesis and augmentation,” in Proc. WASPAA, 2017, pp. 344–348.
  29. “Auditory model based direction estimation of concurrent speakers from binaural signals,” Speech Communication, vol. 53, no. 5, pp. 592–605, 2011.
  30. “Listen only to me! How well can target speech extraction handle false alarms?,” in Proc. Interspeech, 2022, pp. 216–220.
Citations (1)

Summary

  • The paper introduces a novel ITD loss that optimizes cross-correlation to effectively preserve critical spatial cues.
  • It employs a multi-task loss combining signal-level and spatial metrics, maintaining robust SI-SNR and SNR performance.
  • Experimental results show a 26.2μs reduction in ITD errors, demonstrating enhanced spatial fidelity across diverse acoustic conditions.

Interaural Time Difference Loss for Binaural Target Sound Extraction

The paper presents an investigation into binaural target sound extraction (TSE) with a focus on preserving spatial cues through the introduction of explicit spatial losses, particularly an interaural time difference (ITD) loss. The key contribution of this work is a novel ITD loss that aims to preserve spatial information more effectively than existing methods.

Background

Traditional target sound extraction (TSE) systems leverage neural networks to isolate a sound of interest from a mixture. While conventional single-channel TSE systems primarily rely on signal-level losses, these approaches tend to overlook spatial cues such as interaural level difference (ILD) and interaural time difference (ITD). Such spatial information is critical for applications like hearing aids and audio post-production where spatial localization of sound is crucial. The paper hypothesizes that explicitly incorporating spatial losses can improve the spatial fidelity of extracted sounds in a binaural context.

Proposed Methodology

The authors propose a multi-task loss combining signal-level losses with spatial losses to enhance the binaural TSE system. The key innovations include:

  1. Interaural Level Difference (ILD) Loss: Measures the level difference between left and right channels to preserve spatial integrity.
  2. Interaural Phase Difference (IPD) Loss: Measures the phase difference, addressing the circularity problem of phase measurements.
  3. Novel Interaural Time Difference (ITD) Loss: Focuses on the errors between cross-correlation coefficients of estimated and reference signals, facilitating better ITD retention.

The ITD loss is particularly notable for its novel approach of being directly related to the cross-correlation of signals, ensuring differentiability and thereby suitability for neural network training.

Experimental Setup

Experiments were conducted using a dataset of simulated reverberant mixtures from various sound classes, including environmental sounds and urban noise. The dataset involved convolution of sound signals with head-related transfer functions (HRTFs) and room impulse responses (RIRs) to produce binaural recordings. The TSE model employed is an encoder-decoder architecture conditioned on a target sound class embedding.

Results

The evaluation metrics encompassed both signal-level (SI-SNR and SNR) and spatial metrics (ΔILD, ΔIPD, ΔITD). The results demonstrated:

  1. Improvement in ILD and ITD Errors: Systems incorporating ILD and ITD losses showed reductions in ILD and ITD errors without degrading signal-level metrics.
  2. Superiority of ITD Loss: The ITD loss led to the most substantial reduction in ITD errors (26.2μs), indicating the model’s enhanced ability to preserve spatial cues across different sound environments.
  3. Minimal Impact on Signal-Level Performance: Despite adding spatial losses, the models maintained comparable SI-SNR, SNR, and failure rates to the baseline, confirming the robustness of the multi-task learning approach.

Implications and Future Work

The findings highlight the importance of integrating spatial losses in binaural TSE systems to optimize spatial cue preservation. The proposed ITD loss, in particular, offers a promising direction for enhancing binaural processing in various applications, including binaural speech enhancement and separation tasks. Future developments could investigate extending these techniques to other audio processing domains and further refine the ITD loss function, potentially incorporating perception-based filters to optimize performance.

This research underscores the potential for spatially-aware loss functions to elevate the fidelity of neural network-based sound extraction systems, thereby contributing to improved user experiences in real-world applications where spatial localization of sound sources is crucial.

X Twitter Logo Streamline Icon: https://streamlinehq.com