Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR (2403.02010v1)

Published 4 Mar 2024 in cs.SD and eess.AS

Abstract: Multi-talker automatic speech recognition plays a crucial role in scenarios involving multi-party interactions, such as meetings and conversations. Due to its inherent complexity, this task has been receiving increasing attention. Notably, the serialized output training (SOT) stands out among various approaches because of its simplistic architecture and exceptional performance. However, the frequent speaker changes in token-level SOT (t-SOT) present challenges for the autoregressive decoder in effectively utilizing context to predict output sequences. To address this issue, we introduce a masked t-SOT label, which serves as the cornerstone of an auxiliary training loss. Additionally, we utilize a speaker similarity matrix to refine the self-attention mechanism of the decoder. This strategic adjustment enhances contextual relationships within the same speaker's tokens while minimizing interactions between different speakers' tokens. We denote our method as speaker-aware SOT (SA-SOT). Experiments on the Librispeech datasets demonstrate that our SA-SOT obtains a relative cpWER reduction ranging from 12.75% to 22.03% on the multi-talker test sets. Furthermore, with more extensive training, our method achieves an impressive cpWER of 3.41%, establishing a new state-of-the-art result on the LibrispeechMix dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “The rich transcription 2007 meeting recognition evaluation,” in International Evaluation Workshop on Rich Transcription. Springer, 2007, pp. 373–389.
  2. “The ami meeting corpus,” in Proc. ICMT, 2005, vol. 88, p. 100.
  3. “Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in Proc. CHiME, 2020, pp. 1–7.
  4. “Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis,” in Proc. SLT. IEEE, 2021, pp. 897–904.
  5. “M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge,” in Proc. ICASSP. IEEE, 2022, pp. 6167–6171.
  6. “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in Proc. INTERSPEECH. ISCA, 2020, pp. 36–40.
  7. “End-to-end multi-speaker speech recognition with transformer,” in Proc. ICASSP. IEEE, 2020, pp. 6134–6138.
  8. “Mimo-speech: End-to-end multi-channel multi-speaker speech recognition,” in Proc. ASRU. IEEE, 2019, pp. 237–244.
  9. “End-to-end multi-speaker asr with independent vector analysis,” in Proc. SLT. IEEE, 2023, pp. 496–501.
  10. “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018.
  11. “A purely end-to-end system for multi-speaker speech recognition,” arXiv preprint arXiv:1805.05826, 2018.
  12. “Streaming target-speaker asr with neural transducer,” in Proc. INTERSPEECH. ISCA, 2022, pp. 2673–2677.
  13. “End-to-end monaural multi-speaker asr system without pretraining,” in Proc. ICASSP. IEEE, 2019, pp. 6256–6260.
  14. “End-to-end multi-talker overlapping speech recognition,” in Proc. ICASS. IEEE, 2020, pp. 6129–6133.
  15. “Streaming multi-speaker asr with rnn-t,” in Proc. ICASSP. IEEE, 2021, pp. 6903–6907.
  16. “Multi-speaker asr combining non-autoregressive conformer ctc and conditional speaker chain,” in Proc. ITERSPEECH. ISCA, 2021, pp. 3720–3724.
  17. “Serialized output training for end-to-end overlapped speech recognition,” in Proc. INTERSPEECH. ISCA, 2020, pp. 2797–2801.
  18. “Many-speakers single channel speech separation with optimal permutation training,” in Proc. INTERSPEECH. ISCA, 2021, pp. 3890–3894.
  19. “Streaming multi-talker asr with token-level serialized output training,” in Proc. INTERSPEECH. ISCA, 2022, pp. 3774–3778.
  20. “Ba-sot: Boundary-aware serialized output training for multi-talker asr,” in Proc. INTERSPEECH. ISCA, 2023.
  21. “To reverse the gradient or not: An empirical comparison of adversarial and multi-task learning in speech recognition,” in Proc. ICASSP. IEEE, 2019, pp. 3742–3746.
  22. “Streaming speaker-attributed asr with token-level speaker embeddings,” in Proc. INTERSPEECH. ISCA, 2022, pp. 521–525.
  23. Linhao Dong and Bo Xu, “Cif: Continuous integrate-and-fire for end-to-end speech recognition,” in Proc. ICASSP. IEEE, 2020, pp. 6079–6083.
  24. “Large margin softmax loss for speaker verification,” in Proc. INTERSPEECH. ISCA, 2019, pp. 2873–2877.
  25. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Proc. INTERSPEECH. ISCA, 2017, pp. 498–502.
  26. “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. INTERSPEECH. ISCA, 2020, pp. 5036–5040.
  27. “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
  28. “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
  29. “End-to-end speaker-attributed asr with transformer,” in Proc. INTERSPEECH. ISCA, 2021, pp. 4413–4417.
  30. “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  31. Alex Graves, “Sequence transduction with recurrent neural networks,” in Proc. ICML, 2012.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhiyun Fan (8 papers)
  2. Linhao Dong (16 papers)
  3. Jun Zhang (1008 papers)
  4. Lu Lu (189 papers)
  5. Zejun Ma (78 papers)
Citations (4)