Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CATSE: A Context-Aware Framework for Causal Target Sound Extraction (2403.14246v1)

Published 21 Mar 2024 in eess.AS and cs.AI

Abstract: Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita, and S. Araki, “Listen to what you want: Neural network-based universal sound selector,” Interspeech 2020, 2020.
  2. M. Delcroix, J. B. Vázquez, T. Ochiai, K. Kinoshita, Y. Ohishi, and S. Araki, “Soundbeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning,” IEEE/ACM TASLP, vol. 31, pp. 121–136, 2022.
  3. S. Baligar and S. Newsam, “Cossd-an end-to-end framework for multi-instance source separation and detection,” in EUSIPCO, 2022, pp. 150–154.
  4. K. Kilgour, B. Gfeller, Q. Huang, A. Jansen, S. Wisdom, and M. Tagliasacchi, “Text-driven separation of arbitrary sounds,” arXiv preprint arXiv:2204.05738, 2022.
  5. R. Gao and K. Grauman, “Co-separating sounds of visual objects,” in ICCV 2019, 2019, pp. 3879–3888.
  6. O. Slizovskaia, G. Haro, and E. Gómez, “Conditioned source separation for musical instrument performances,” IEEE/ACM TASLP, vol. 29, pp. 2083–2095, 2021.
  7. X. Yang and D. W. Grantham, “Echo suppression and discrimination suppression aspects of the precedence effect,” Perception & psychophysics, vol. 59, pp. 1108–1117, 1997.
  8. B. Veluri, J. Chan, M. Itani, T. Chen, T. Yoshioka, and S. Gollakota, “Real-time target sound extraction,” in ICASSP, 2023, pp. 1–5.
  9. C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in ECCV Workshops 2016, 2016, pp. 47–54.
  10. B. Gfeller, D. Roblek, and M. Tagliasacchi, “One-shot conditional audio filtering of arbitrary sounds,” in ICASSP, 2021, pp. 501–505.
  11. Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang, and M. D. Plumbley, “Source separation with weakly labelled data: An approach to computational auditory scene analysis,” in ICASSP, 2020, pp. 101–105.
  12. I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey, “Universal sound separation,” in WASPAA 2019, 2019, pp. 175–179.
  13. C. Zheng, H. Zhang, W. Liu, X. Luo, A. Li, X. Li, and B. C. Moore, “Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,” Trends in Hearing, vol. 27, p. 23312165231209913, 2023.
  14. I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Y, A. Mandell, Y. G, M. Mattina, and P. N. Whatmough, “Tinylstms: Efficient neural speech enhancement for hearing aids,” arXiv preprint arXiv:2005.11138, 2020.
  15. Y. Liu and D. Wang, “Causal deep casa for monaural talker-independent speaker separation,” IEEE/ACM TASLP, vol. 28, pp. 2109–2118, 2020.
  16. Y. W. C. Li and Y. Qian, “Predictive skim: Contrastive predictive coding for low-latency online speech separation,” in ICASSP, 2023, pp. 1–5.
  17. C. Z. W. Liu, A. Li and X. Li, “A separation and interaction framework for causal multi-channel speech enhancement,” Digit. Signal Process., vol. 126, p. 103519, 2022.
  18. Z.-Q. Wang, G. Wichern, S. Watanabe, and J. Le Roux, “Stft-domain neural speech enhancement with very low algorithmic latency,” IEEE/ACM TASLP, vol. 31, pp. 397–410, 2022.
  19. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
  20. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP, 2019, pp. 626–630.
  21. E. Fonseca, M. Plakal, F. Font, D. P. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,” in DCASE, 2018, p. 69.
  22. A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” in DCASE, 2018, p. 9.
  23. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017, pp. 776–780.
  24. J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in WASPAA, 2017, pp. 344–348.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com