Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Speaker-Attributed ASR with Transformer (2104.02128v1)

Published 5 Apr 2021 in eess.AS, cs.CL, and cs.SD

Abstract: This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speaker-attributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Naoyuki Kanda (61 papers)
  2. Guoli Ye (15 papers)
  3. Yashesh Gaur (43 papers)
  4. Xiaofei Wang (138 papers)
  5. Zhong Meng (53 papers)
  6. Zhuo Chen (319 papers)
  7. Takuya Yoshioka (77 papers)
Citations (45)

Summary

We haven't generated a summary for this paper yet.