Real-time Low-latency Music Source Separation using Hybrid Spectrogram-TasNet (2402.17701v1)
Abstract: There have been significant advances in deep learning for music demixing in recent years. However, there has been little attention given to how these neural networks can be adapted for real-time low-latency applications, which could be helpful for hearing aids, remixing audio streams and live shows. In this paper, we investigate the various challenges involved in adapting current demixing models in the literature for this use case. Subsequently, inspired by the Hybrid Demucs architecture, we propose the Hybrid Spectrogram Time-domain Audio Separation Network HS-TasNet, which utilises the advantages of spectral and waveform domains. For a latency of 23 ms, the HS-TasNet obtains an overall signal-to-distortion ratio (SDR) of 4.65 on the MusDB test set, and increases to 5.55 with additional training data. These results demonstrate the potential of efficient demixing for real-time low-latency music applications.
- “Real time speech enhancement in the waveform domain,” arXiv preprint arXiv:2006.12847, 2020.
- Yi Luo and Nima Mesgarani, “TasNet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. IEEE ICASSP, 2018, pp. 696–700.
- Yi Luo and Nima Mesgarani, “Real-time single-channel dereverberation and separation with time-domain audio separation network.,” in Interspeech, 2018, pp. 342–346.
- “Sound event detection and separation: a benchmark on desed synthetic soundscapes,” in Proc. IEEE ICASSP, 2021, pp. 840–844.
- “Musical source separation: An introduction,” IEEE Signal Process. Mag., vol. 36, no. 1, pp. 31–40, 2018.
- “Nonnegative tensor factorization for source separation of loops in audio,” in Proc. IEEE ICASSP, 2018, pp. 171–175.
- “An audio-visual system for object-based audio: from recording to listening,” IEEE Trans. Multimedia, vol. 20, no. 8, pp. 1919–1931, 2018.
- “Spleeter: a fast and efficient music source separation tool with pre-trained models,” J. Open Source Softw., vol. 5, no. 50, pp. 2154, 2020.
- “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
- “Real-time sound source separation: Azimuth discrimination and resynthesis,” in 117th Convention of the Audio Engineering Society, 2004.
- “Design and evaluation of a real-time audio source separation algorithm to remix music for cochlear implant users,” Front. Neurosci., vol. 14, pp. 434, 2020.
- Yi Luo and Nima Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
- “STFT-domain neural speech enhancement with very low algorithmic latency,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 397–410, 2022.
- “Singing voice separation with deep U-Net convolutional networks,” in Proc. ISMIR, 2017, pp. 23–27.
- “Wave-U-Net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
- “All for one and one for all: Improving music separation by bridging networks,” in Proc. IEEE ICASSP, 2021, pp. 51–55.
- “Decoupling magnitude and phase estimation with deep resunet for music source separation,” arXiv preprint arXiv:2109.05418, 2021.
- Yi Luo and Jianwei Yu, “Music source separation with band-split RNN,” arXiv preprint arXiv:2209.15174, 2022.
- “Kuielab-mdx-net: A two-stream neural network for music demixing,” arXiv preprint arXiv:2111.12203, 2021.
- Alexandre Défossez, “Hybrid spectrogram and waveform source separation,” arXiv preprint arXiv:2111.03600, 2021.
- Dan Barry, “Real-time sound source separation for music applications,” Doctoral thesis, Technological University Dublin, 2019.
- “A frugal approach to music source separation,” hal-02986241, 2020.
- “Hybrid transformers for music source separation,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- “Open-unmix-a reference implementation for music source separation,” J. Open Source Softw., vol. 4, no. 41, pp. 1667, 2019.
- “Meta-learning extractors for music source separation,” in Proc. IEEE ICASSP, 2020, pp. 816–820.
- “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
- “ICASSP 2023 deep speech enhancement challenge,” arXiv preprint arXiv:2303.11510, 2023.
- “Norbert: Multichanne-wiener filtering,” Sept. 2019.
- “Filterbank design for end-to-end speech separation,” in Proc. IEEE ICASSP, 2020, pp. 6364–6368.
- “Speaker recognition from raw waveform with sincnet,” in Proc. IEEE SLT, 2018, pp. 1021–1028.
- “Musdb18-hq - an uncompressed version of musdb18,” Aug. 2019.
- “The whole is greater than the sum of its parts: Improving DNN-based music source separation,” arXiv preprint arXiv:2305.07855, 2023.
- “webMUSHRA—a comprehensive framework for web-based listening tests,” J. Open Res. Softw., vol. 6, no. 1, pp. 8, 2018.
- “BSS eval or PEASS? predicting the perception of singing-voice separation,” in Proc. IEEE ICASSP, 2018, pp. 596–600.
- Brian CJ Moore, An introduction to the psychology of hearing, Brill, 2012.
- “Perceptual loss function for neural modeling of audio systems,” in Proc. IEEE ICASSP, 2020, pp. 251–255.
- “Demystifying TasNet: A dissecting approach,” in Proc. IEEE ICASSP, 2020, pp. 6359–6363.
- Kai Li and Yi Luo, “On the design and training strategies for rnn-based online neural speech separation systems,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- “MaD TwinNet: Masker-denoiser architecture with twin networks for monaural sound source separation,” in Proc. IEEE IJCNN, 2018, pp. 1–8.