Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity (2404.01740v1)

Published 2 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework, that enables a powerful semi-supervised framework for audio separation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Visual scene graphs for audio source separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1204–1213, 2021.
  2. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  721–725. IEEE, 2020.
  3. Zero-shot audio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  4441–4449, 2022.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  5. Clipsep: Learning text-queried sound separation with noisy unlabeled videos. arXiv preprint arXiv:2212.07065, 2022.
  6. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14084–14093, 2022.
  7. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3879–3888, 2019.
  8. One-shot conditional audio filtering of arbitrary sounds. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  501–505. IEEE, 2021.
  9. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  976–980. IEEE, 2022.
  10. Oren Halvani. Constituent Treelib - A Lightweight Python Library for Constructing, Processing, and Visualizing Constituent Trees., 3 2023. URL https://github.com/Halvani/constituent-treelib.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  12. Mixcycle: Unsupervised speech separation via cyclic mixture permutation invariant training. IEEE Signal Processing Letters, 29:2637–2641, 2022.
  13. Universal sound separation. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.  175–179. IEEE, 2019.
  14. Text-driven separation of arbitrary sounds. arXiv preprint arXiv:2204.05738, 2022.
  15. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, 2019.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1):50–70, 2020.
  18. Separate what you describe: Language-queried audio source separation. arXiv preprint arXiv:2203.15147, 2022.
  19. Separate anything you describe. arXiv preprint arXiv:2308.05037, 2023.
  20. A casa approach to deep learning based speaker-independent co-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5399–5403. IEEE, 2018.
  21. Listen and look: Audio–visual matching assisted speech source separation. IEEE Signal Processing Letters, 25(9):1315–1319, 2018.
  22. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pp.  23033–23044. PMLR, 2023.
  23. Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  696–700. IEEE, 2018.
  24. Juan F. Montesinos. Torch-mir-eval: Pytorch implementation of mir-eval, 2021. URL https://github.com/JuanFMontesinos/torch_mir_eval.
  25. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  26. Finding strength in weakness: Learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2386–2399, 2020.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  28. Mir_eval: A transparent implementation of common mir metrics. In ISMIR, volume 10, pp.  2014, 2014.
  29. All for one and one for all: Improving music separation by bridging networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  51–55. IEEE, 2021.
  30. Open-unmix-a reference implementation for music source separation. Journal of Open Source Software, 4(41):1667, 2019.
  31. Densely connected multi-dilated convolutional networks for dense prediction tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  993–1002, 2021.
  32. Language-guided audio-visual source separation via trimodal consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10575–10584, 2023.
  33. Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2745–2754, 2021.
  34. Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. arXiv preprint arXiv:2011.01143, 2020.
  35. Optimal condition training for target source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.
  38. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1702–1726, 2018.
  39. Unsupervised sound separation using mixture invariant training. Advances in Neural Information Processing Systems, 33:3846–3857, 2020.
  40. Sparse, efficient, and semantic mixture invariant training: Taming in-the-wild unsupervised sound separation. In 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.  51–55. IEEE, 2021.
  41. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  42. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2945–2954, 2023.
  43. Torchaudio: Building blocks for audio and speech processing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6982–6986. IEEE, 2022.
  44. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  241–245. IEEE, 2017.
  45. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14393–14402, 2021.
  46. Adl-mvdr: All deep learning mvdr beamformer for target speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6089–6093. IEEE, 2021.
  47. The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pp.  570–586, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com