Overview of Advancing Multi-Talker ASR Performance with LLMs
The paper presents a novel approach to improving multi-talker automatic speech recognition (ASR) by integrating the capabilities of LLMs with serialized output training (SOT). Recognizing overlapping speech from multiple speakers in conversational scenarios remains an intricate challenge for ASR systems. Traditional approaches, such as attention-based encoder-decoder (AED) architectures, have struggled with performance bottlenecks due to their limited long-context modeling ability. This paper proposes exploiting the inherent strengths of pre-trained LLMs to overcome these limitations.
Key Contributions
The central contribution of this work is the integration of a pre-trained speech encoder with a projector and a pre-trained LLM, followed by fine-tuning on multi-talker datasets. The methodology leverages the LLM's proficiency in long-context and cross-utterance modeling to manage SOT-style transcriptions, which are derived by concatenating transcriptions from multiple speakers. This paper evaluates the proposed approach against traditional AED methods and demonstrates superior performance in various scenarios, including the simulated LibriMix dataset and the real-world evaluation set of the AMI meeting corpus.
Experimental Results
- LibriMix Dataset: The proposed LLM-based method achieves a word error rate (WER) of 9.0% on the LibriMix test set, comparing favorably to the AED approach which recorded a WER of 9.2%. The performance improvement, albeit modest, highlights the effectiveness of LLMs in scenarios involving two speakers per utterance.
- AMI Meeting Corpus: The LLM-based ASR method achieves state-of-the-art performance on the AMI dataset, surpassing AED models trained with 1000 times more supervised data. Specifically, the LLM-based system records a cpWER of 20.4%, significantly outperforming competitors pre-trained on large-scale supervised datasets, which demonstrates the robustness of LLMs in handling real-world multi-talker situations.
Implications and Future Directions
The results elucidate the potential of LLMs in enhancing ASR for complex, multi-talker environments. The profound improvement suggests that prioritizing advanced decoder architectures could substantially benefit multi-talker ASR systems, emphasizing decoder pre-training over encoder-centric approaches. Furthermore, the research implies that LLMs could serve as a foundation for developing more generalized speech recognition systems capable of handling varying degrees of conversational complexity.
Future research could explore extending the application of LLM-based ASR to other complex audio processing tasks, such as speech enhancement and diarization, to further unearth the capabilities of LLMs in diverse audio scenarios. Additionally, advancements in training strategies, possibly combining unsupervised and reinforcement learning techniques, could provide further performance gains and enhance the scalability of LLM-based ASR frameworks.
In summary, this paper underscores the effectiveness of leveraging LLMs with SOT for multi-talker ASR, setting a new benchmark in the domain and opening pathways for further innovations centered around LLMs in speech technology.