Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-visual Multi-channel Recognition of Overlapped Speech (2005.08571v2)

Published 18 May 2020 in eess.AS, cs.CL, and cs.SD

Abstract: Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jianwei Yu (64 papers)
  2. Bo Wu (144 papers)
  3. Rongzhi Gu (28 papers)
  4. Shi-Xiong Zhang (48 papers)
  5. Lianwu Chen (14 papers)
  6. Yong Xu. Meng Yu (1 paper)
  7. Dan Su (101 papers)
  8. Dong Yu (328 papers)
  9. Xunying Liu (92 papers)
  10. Helen Meng (204 papers)
Citations (17)