DENSE: Dynamic Embedding Causal Target Speech Extraction (2409.06136v2)
Abstract: Target speech extraction (TSE) focuses on extracting the speech of a specific target speaker from a mixture of signals. Existing TSE models typically utilize static embeddings as conditions for extracting the target speaker's voice. However, the static embeddings often fail to capture the contextual information of the extracted speech signal, which may limit the model's performance. We propose a novel dynamic embedding causal target speech extraction model to address this limitation. Our approach incorporates an autoregressive mechanism to generate context-dependent embeddings based on the extracted speech, enabling real-time, frame-level extraction. Experimental results demonstrate that the proposed model enhances short-time objective intelligibility (STOI) and signal-to-distortion ratio (SDR), offering a promising solution for target speech extraction in challenging scenarios.
- Colin Cherry, “Cocktail party problem,” Journal of the Acoustical Society of America, vol. 25, pp. 975–979, 1953.
- “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 691–695.
- “Soundbeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 121–136, 2022.
- “Real-time target sound extraction,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Target conversation extraction: Source separation using turn-taking dynamics,” in Interspeech 2024, 2024, pp. 3550–3554.
- “Target sound extraction with variable cross-modality clues,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Prompt-driven target speech diarization,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11086–11090.
- “Usev: Universal speaker extraction with visual cue,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 3032–3045, 2022.
- “Tse-pi: Target sound extraction under reverberant environments with pitch information,” in Interspeech 2024, 2024, pp. 602–606.
- “Conceptbeam: Concept driven target speech extraction,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4252–4260.
- “Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 804–808.
- “X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion,” Information Fusion, vol. 112, pp. 102550, 2024.
- “Target speech extraction with pre-trained self-supervised learning models,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10421–10425.
- “Iterative autoregression: a novel trick to improve your low-latency speech enhancement model,” in Proc. INTERSPEECH 2023, 2023, pp. 2448–2452.
- “Paris: Pseudo-autoregressive siamese training for online speech separation,” in Interspeech 2024, 2024, pp. 582–586.
- “Look once to hear: Target speech hearing with noisy examples,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–16.
- “Online target sound extraction with knowledge distillation from partially non-causal teacher,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 561–565.
- Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073.
- “Lpcnet: Improving neural speech synthesis through linear prediction,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5891–5895.
- “Audiolm: a language modeling approach to audio generation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523–2533, 2023.
- “Listening and grouping: an online autoregressive approach for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 692–703, 2019.
- “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
- “Wham!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
- “Asteroid: the pytorch-based audio source separation toolkit for researchers,” arXiv preprint arXiv:2005.04132, 2020.
- “Pytorch distributed: Experiences on accelerating data parallel training,” arXiv preprint arXiv:2006.15704, 2020.
- “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010, pp. 4214–4217.
- “Target Speech Extraction with Conditional Diffusion Model,” in Proc. INTERSPEECH 2023, 2023, pp. 176–180.
- “Teacher-student deep clustering for low-delay single channel speech separation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 690–694.
- “Heterogeneous target speech separation,” in Interspeech 2022, 2022, pp. 1796–1800.
- Xu Tan, “Lessons from the autoregressive/nonautoregressive battle in speech synthesis,” [Online]. Available: https://tan-xu.github.io/AR-NAR-TTS.pdf, 2024, Microsoft Research Asia, [email protected].
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.