Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction (2403.02918v1)
Abstract: In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.
- Noise reduction in speech processing. Vol. 2. Springer Science & Business Media.
- Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing 27, 2 (1979), 113–120.
- Premotor cortex modulates somatosensory cortex during voluntary movements without proprioceptive feedback. Nature neuroscience 10, 4 (2007), 417–419.
- Noise reduction in speech processing. Springer.
- Li Deng. 2016. Deep learning: from speech recognition to language and multimodal processing. APSIPA Transactions on Signal and Information Processing 5 (2016), e1.
- Speech separation of a target speaker based on deep neural networks. In 2014 12th International Conference on Signal Processing (ICSP). IEEE, 473–477.
- SpEx+: A Complete Time Domain Speaker Extraction Network. arXiv:2005.04686 [eess.AS]
- Spex+: A complete time domain speaker extraction network. arXiv preprint arXiv:2005.04686 (2020).
- Google. 2023. Google Cloud Text-to-Speech AI. https://cloud.google.com/text-to-speech/?hl=en
- Multi-modal multi-channel target speech separation. IEEE Journal of Selected Topics in Signal Processing 14, 3 (2020), 530–541.
- Speakerfilter: Deep learning-based target speaker extraction using anchor speech. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 376–380.
- Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 31–35.
- Ego noise suppression of a robot using template subtraction. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 199–204.
- Sanjeev N Jain and Chandrashekhar Rai. 2012. Blind source separation and ICA techniques: a review. International Journal of Engineering Science and Technology 4, 4 (2012), 1490–1503.
- Adaptive blind audio source extraction supervised by dominant speaker identification using x-vectors. arXiv:1910.11824 [eess.AS]
- SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 626–630.
- VoiceFixer: Toward General Speech Restoration With Neural Vocoder. arXiv:2109.13731 [cs.SD]
- WHAMR!: Noisy and reverberant single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 696–700.
- Wolfgang Mack and Emanuël AP Habets. 2019. Deep filtering: Signal extraction using complex time-frequency filters. arXiv preprint arXiv:1904.08369 (2019).
- Humanoid active audition system improved by the cover acoustics. In PRICAI 2000 Topics in Artificial Intelligence: 6th Pacific Rim International Conference on Artificial Intelligence Melbourne, Australia, August 28–September 1, 2000 Proceedings 6. Springer, 544–554.
- Blind source separation with parameter-free adaptive step-size method for robot audition. IEEE transactions on audio, speech, and language processing 18, 6 (2009), 1476–1485.
- Nathan Lively. [n. d.]. Audio Analyzers: Pink Noise vs Sine Sweep. https://www.sounddesignlive.com/audio-analyzers-pink-noise-vs-sine-sweep/
- Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues. In Proc. Interspeech 2019. 2718–2722. https://doi.org/10.21437/Interspeech.2019-1513
- The Social Interaction Cloud (SIC). https://socialrobotics.atlassian.net/wiki/spaces/CBSR/overview?homepageId=2186870789
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.
- Human interaction smart subsystem—extending speech-based human-robot interaction systems with an implementation of external smart sensors. Sensors 20, 8 (2020), 2376.
- Robust Speech Recognition via Large-Scale Weak Supervision. https://doi.org/10.48550/ARXIV.2212.04356
- VAD techniques for real-time speech transmission on the Internet. In 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No. 02EX612). IEEE, 46–50.
- Acoustic self-awareness of autonomous systems in a world of sounds. Proc. IEEE 108, 7 (2020), 1127–1149.
- Gabriel Skantze. 2021. Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language 67 (2021), 101178. https://doi.org/10.1016/j.csl.2020.101178
- Gabriel Skantze and Joakim Gustafson. 2009. Attention and interaction control in a human-human-computer dialogue setting. In Proceedings of the SIGDIAL 2009 conference. 310–313.
- Enhancing listening capability of humanoid robot by reduction of stationary ego-noise. IEEJ Transactions on Electrical and Electronic Engineering 14, 12 (2019), 1815–1822.
- Selective attention reduces physiological noise in the external ear canals of humans. I: Auditory attention. Hearing research 312 (2014), 143–159.
- VoiceFilter-Lite: Streaming targeted voice separation for on-device speech recognition. arXiv preprint arXiv:2009.04323 (2020).
- VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. Interspeech 2019. 2728–2732. https://doi.org/10.21437/Interspeech.2019-1101
- Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483–492.
- Seung won Park. [n. d.]. Unofficial PyTorch implementation of Google AI’s VoiceFilter system.
- Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 241–245.
- Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing 13, 4 (2019), 800–814.
- Neural Target Speech Extraction: An overview. IEEE Signal Processing Magazine 40, 3 (2023), 8–29.
- SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE Journal of Selected Topics in Signal Processing 13, 4 (2019), 800–814. https://doi.org/10.1109/JSTSP.2019.2922820