Hierarchical speaker representation for target speaker extraction (2210.15849v3)
Abstract: Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enroLLMent utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR substantially outperforms state-of-the-art time-frequency domain techniques. Further demonstrating HR's capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise Suppression Challenge. The proposed HR methodology shows great promise for advancing target speaker extraction through enhanced anchor utilization.
- “Speakerfilter: Deep learning-based target speaker extraction using anchor speech,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 376–380.
- “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
- “Compact network for speakerbeam target speaker extraction,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6965–6969.
- “Time-domain speaker extraction network,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 327–334.
- “SpEx+: A Complete Time Domain Speaker Extraction Network,” in Proc. Interspeech 2020, 2020, pp. 1406–1410.
- “Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9291–9295.
- “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1370–1384, 2020.
- “Modelling auditory attention,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 372, no. 1714, pp. 20160101, 2017.
- “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 691–695.
- “Tea-pse 3.0: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2023 dns-challenge,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2.
- “Mc-spex: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation,” arXiv preprint arXiv:2306.16250, 2023.
- “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” in Proc. Interspeech 2019, 2019, pp. 2728–2732.
- “Text-to-speech for low-resource agglutinative language with morphology-aware language model pre-training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–13, 2023.
- Y Lim, “A digital filter bank for digital audio systems,” IEEE Transactions on Circuits and Systems, vol. 33, no. 8, pp. 848–849, 1986.
- “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.
- “Speakerfilter-pro: an improved target speaker extractor combines the time domain and frequency domain,” arXiv preprint arXiv:2010.13053, 2020.
- “Self-attending rnn for speech enhancement to improve cross-corpus generalization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1374–1385, 2022.
- “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 2, pp. 652–662, 2019.
- “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- “Deep filtering: Signal extraction and reconstruction using complex time-frequency filters,” IEEE Signal Processing Letters, vol. 27, pp. 61–65, 2019.
- Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Target speaker verification with selective auditory attention for single and multi-talker speech,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 29, pp. 2696–2709, 2021.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “Icassp 2022 deep noise suppression challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9271–9275.
- Ke Tan and DeLiang Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019.
- “Personalized speech enhancement: New models and comprehensive evaluation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 356–360.