Real-time Stereo Speech Enhancement with Spatial-Cue Preservation based on Dual-Path Structure (2402.00337v1)
Abstract: We introduce a real-time, multichannel speech enhancement algorithm which maintains the spatial cues of stereo recordings including two speech sources. Recognizing that each source has unique spatial information, our method utilizes a dual-path structure, ensuring the spatial cues remain unaffected during enhancement by applying source-specific common-band gain. This method also seamlessly integrates pretrained monaural speech enhancement, eliminating the need for retraining on stereo inputs. Source separation from stereo mixtures is achieved via spatial beamforming, with the steering vector for each source being adaptively updated using post-enhancement output signal. This ensures accurate tracking of the spatial information. The final stereo output is derived by merging the spatial images of the enhanced sources, with its efficacy not heavily reliant on the separation performance of the beamforming. The algorithm runs in real-time on 10-ms frames with a 40 ms of look-ahead. Evaluations reveal its effectiveness in enhancing speech and preserving spatial cues in both fully and sparsely overlapped mixtures.
- Philipos C. Loizou, Speech enhancement : theory and practice / Philipos C. Loizou., CRC Press, Boca Raton, Fla, 2nd ed. edition, 2013.
- Speech Enhancement, Springer, 2005.
- Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
- S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.
- “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015.
- “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. IEEE ICASSP, 2015, pp. 708–712.
- “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE ICASSP, 2016, pp. 31–35.
- “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE ICASSP, 2017, pp. 241–245.
- “SEGAN: Speech Enhancement Generative Adversarial Network,” in Proc. Interspeech, 2017, pp. 3642–3646.
- Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. IEEE ICASSP, 2016, pp. 196–200.
- “Frame-by-frame closed-form update for mask-based adaptive mvdr beamforming,” in Proc. IEEE ICASSP, 2018, pp. 531–535.
- Masahito Togami, “Multi-channel itakura saito distance minimization with deep neural network,” in Proc. IEEE ICASSP, 2019, pp. 536–540.
- “Multichannel Loss Function for Supervised Speech Source Separation by Mask-Based Beamforming,” in Proc. Interspeech, 2019, pp. 2708–2712.
- “Real-time binaural speech separation with preserved spatial cues,” in Proc. IEEE ICASSP, 2020, pp. 6404–6408.
- “A training framework for stereo-aware speech enhancement using deep neural networks,” arXiv preprint arXiv:2112.04939, 2020.
- “Binaural cue preservation for hearing aids using an interaural transfer function multichannel wiener filter,” in Proc. IEEE ICASSP, 2007, vol. 4, pp. IV–565–IV–568.
- “Binaural noise cue preservation in a binaural noise reduction system with a remote microphone signal,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 952–966, 2016.
- “Binaural speech enhancement with spatial cue preservation utilising simultaneous masking,” in Proc. EUSIPCO, 2017, pp. 598–602.
- “Preservation of interaural time delay for binaural hearing aids through multi-channel wiener filtering based noise reduction,” in Proc. IEEE ICASSP, 2005, vol. 3, pp. iii/29–iii/32 Vol. 3.
- “Theoretical analysis of binaural multimicrophone noise reduction techniques,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 342–355, 2010.
- “ILD preservation in the multichannel wiener filter for binaural hearing aid applications,” in Proc. EUSIPCO, 2014, pp. 636–640.
- “Cue-preserving mmse filter with bayesian snr marginalization for binaural speech enhancement,” in Proc. IEEE ICASSP, 2021, pp. 6124–6128.
- “A perceptually-motivated approach for low-complexity, real-time enhancement of fullband speech,” in Proc. Interspeech, 2020, pp. 2482–2486.
- “The INTERSPEECH 2020 deep noise suppression challenge: Datasets, subjective speech quality and testing framework,” arXiv preprint arXiv:2001.08662, 2020.
- “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in Proc. IEEE ICASSP, 2018, pp. 1–5.
- “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in Proc. IEEE ASRU, 2015, pp. 504–511.
- “Linguistic data consortium, and nist multimodal information group, csr-ii (wsj1) complete ldc94s13a, linguisticdata consortium, philadelphia, 1994, web download.,” .
- “Librispeech: An asr corpus based on public domain audio books,” in Proc. IEEE ICASSP, 2015, pp. 5206–5210.
- ITU-T, “Recommendation p.800: Methods for subjective determination of transmission quality,” 1996.
- ITU-T, “Recommendation p.808: Subjective evaluation of speech quality with a crowdsourcing approach,” 2018.
- B. Nadari and R. Cutler, “An open source implementation of ITU-T recommendation P.808 with validation,” in Proc. Interspeech, 2020.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.