Advancing Multi-talker ASR Performance with Large Language Models (2408.17431v1)

Published 30 Aug 2024 in eess.AS and cs.AI

Abstract: Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing LLMs that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.

PDF HTML Abstract

Overview of Advancing Multi-Talker ASR Performance with LLMs

The paper presents a novel approach to improving multi-talker automatic speech recognition (ASR) by integrating the capabilities of LLMs with serialized output training (SOT). Recognizing overlapping speech from multiple speakers in conversational scenarios remains an intricate challenge for ASR systems. Traditional approaches, such as attention-based encoder-decoder (AED) architectures, have struggled with performance bottlenecks due to their limited long-context modeling ability. This paper proposes exploiting the inherent strengths of pre-trained LLMs to overcome these limitations.

Key Contributions

The central contribution of this work is the integration of a pre-trained speech encoder with a projector and a pre-trained LLM, followed by fine-tuning on multi-talker datasets. The methodology leverages the LLM's proficiency in long-context and cross-utterance modeling to manage SOT-style transcriptions, which are derived by concatenating transcriptions from multiple speakers. This paper evaluates the proposed approach against traditional AED methods and demonstrates superior performance in various scenarios, including the simulated LibriMix dataset and the real-world evaluation set of the AMI meeting corpus.

Experimental Results

LibriMix Dataset: The proposed LLM-based method achieves a word error rate (WER) of 9.0% on the LibriMix test set, comparing favorably to the AED approach which recorded a WER of 9.2%. The performance improvement, albeit modest, highlights the effectiveness of LLMs in scenarios involving two speakers per utterance.
AMI Meeting Corpus: The LLM-based ASR method achieves state-of-the-art performance on the AMI dataset, surpassing AED models trained with 1000 times more supervised data. Specifically, the LLM-based system records a cpWER of 20.4%, significantly outperforming competitors pre-trained on large-scale supervised datasets, which demonstrates the robustness of LLMs in handling real-world multi-talker situations.

Implications and Future Directions

The results elucidate the potential of LLMs in enhancing ASR for complex, multi-talker environments. The profound improvement suggests that prioritizing advanced decoder architectures could substantially benefit multi-talker ASR systems, emphasizing decoder pre-training over encoder-centric approaches. Furthermore, the research implies that LLMs could serve as a foundation for developing more generalized speech recognition systems capable of handling varying degrees of conversational complexity.

Future research could explore extending the application of LLM-based ASR to other complex audio processing tasks, such as speech enhancement and diarization, to further unearth the capabilities of LLMs in diverse audio scenarios. Additionally, advancements in training strategies, possibly combining unsupervised and reinforcement learning techniques, could provide further performance gains and enhance the scalability of LLM-based ASR frameworks.

In summary, this paper underscores the effectiveness of leveraging LLMs with SOT for multi-talker ASR, setting a new benchmark in the domain and opening pathways for further innovations centered around LLMs in speech technology.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Mohan Shi (9 papers)
Zengrui Jin (30 papers)
Yaoxun Xu (11 papers)
Yong Xu (432 papers)
Shi-Xiong Zhang (48 papers)
Kun Wei (23 papers)
Yiwen Shao (13 papers)
Chunlei Zhang (40 papers)
Dong Yu (328 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos