Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation (2306.10240v1)
Abstract: This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Although this unsupervised technique called neural FCA is in principle free from the domain mismatch problem, it is computationally demanding due to the full rankness of the spatial model in exchange for robustness against relatively short reverberations. To reduce the model complexity without sacrificing performance, we propose neural FastFCA based on the jointly-diagonalizable yet full-rank spatial model. Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering. This alternating architecture enables the separation model to quickly separate the mixture spectrogram by leveraging both the deep neural network and the multichannel optimization algorithm. The training objective with AVI is derived to maximize the marginalized likelihood of the observed mixtures. The experiment using mixture signals of two to four sound sources shows that neural FastFCA outperforms conventional BSS methods and reduces the computational time to about 2% of that for the neural FCA.
- S. Watanabe et al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” arXiv preprint arXiv:2004.09249, 2020.
- J. Du et al., “The USTC-NELSLIP systems for CHiME-6 challenge,” in Proc. CHiME-6 Workshop, 2020, pp. 1–5.
- A. S. Subramanian et al., “Far-field location guided target speech extraction using end-to-end speech recognition objectives,” in Proc. IEEE ICASSP, 2020, pp. 7299–7303.
- R. Scheibler et al., “Sound event localization and detection with pre-trained audio spectrogram transformer and multichannel separation network,” in DCASE Workshop, 2022, pp. 1–5.
- N. Turpault et al., “Improving sound event detection in domestic environments using sound separation,” in Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, 2020, pp. 1–5.
- J. Zhu et al., “Multi-decoder DPRNN: Source separation for variable number of speakers,” in Proc. ICASSP, 2021, pp. 3420–3424.
- E. Tzinis et al., “Sudo rm-rf: Efficient networks for universal audio source separation,” in Proc. IEEE MLSP, 2020, pp. 1–6.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
- N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in Proc. IEEE WASPAA, 2011, pp. 189–192.
- H. Sawada et al., “Multichannel extensions of non-negative matrix factorization with complex-valued data,” IEEE TASLP, vol. 21, no. 5, pp. 971–982, 2013.
- K. Sekiguchi et al., “Fast multichannel nonnegative matrix factorization with directivity-aware jointly-diagonalizable spatial covariance matrices for blind source separation,” IEEE/ACM TASLP, vol. 28, pp. 2610–2625, 2020.
- A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE/ACM TASLP, vol. 18, no. 3, pp. 550–563, 2009.
- N. Ito and T. Nakatani, “FastMNMF: Joint diagonalization based accelerated algorithms for multichannel nonnegative matrix factorization,” in Proc. IEEE ICASSP, 2019, pp. 371–375.
- D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
- K. Sekiguchi et al., “Semi-supervised multichannel speech enhancement with a deep speech prior,” IEEE/ACM TASLP, vol. 27, no. 12, pp. 2197–2212, 2019.
- S. Leglaive et al., “Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization,” in Proc. IEEE ICASSP, 2019, pp. 101–105.
- H. Kameoka et al., “Semi-blind source separation with multichannel variational autoencoder,” arXiv preprint arXiv:1808.00892, 2018.
- Y. Bando et al., “Neural full-rank spatial covariance analysis for blind source separation,” IEEE SPL, vol. 28, pp. 1670–1674, 2021.
- R. Scheibler and M. Togami, “Surrogate source model learning for determined source separation,” in Proc. IEEE ICASSP, May 2021, pp. 176–180.
- K. Saijo and R. Scheibler, “Spatial loss for unsupervised multi-channel source separation,” in Proc. Interspeech, Sep. 2022, pp. 241–245.
- D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM TASLP, vol. 24, no. 9, pp. 1626–1641, 2016.
- R. Scheibler and N. Ono, “Fast and stable blind source separation with rank-1 updates,” in Proc. IEEE ICASSP, 2020, pp. 236–240.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- N. Q. K. Duong et al., “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE/ACM TASLP, vol. 18, no. 7, pp. 1830–1840, 2010.
- Z.-Q. Wang et al., “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in Proc. IEEE ICASSP, 2018, pp. 1–5.
- J. Garofolo et al., “CSR-I (WSJ0) Complete LDC93S6A,” DVD, 2007, Philadelphia: Linguistic Data Consortium.
- T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening,” IEEE TASLP, vol. 20, no. 10, pp. 2707–2720, 2012.
- Y. Bando et al., “Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 3824–3828.
- H. Fu et al., “Cyclical annealing schedule: A simple approach to mitigating KL vanishing,” in Proc. NAACL-HLT, 2019, pp. 240–250.
- E. Vincent et al., “Performance measurement in blind audio source separation,” IEEE TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.
- A. W. Rix et al., “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE ICASSP, vol. 2, 2001, pp. 749–752.
- C. H. Taal et al., “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE ICASSP, 2010, pp. 4214–4217.