Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition (2310.06434v2)

Published 10 Oct 2023 in cs.CL, cs.AI, cs.MM, cs.SD, and eess.AS

Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Srijith Radhakrishnan (2 papers)
  2. Chao-Han Huck Yang (89 papers)
  3. Sumeer Ahmad Khan (2 papers)
  4. Rohit Kumar (80 papers)
  5. Narsis A. Kiani (24 papers)
  6. David Gomez-Cabrero (5 papers)
  7. Jesper N. Tegner (3 papers)
Citations (38)

Summary

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

The paper "Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition" presents a novel framework designed to enhance automatic speech recognition (ASR) systems by leveraging the capabilities of both speech and LLMs. This work proposes a cross-modal fusion technique to perform generative error correction, aiming to correct errors in the transcripts produced by ASR systems.

Overview

The approach extends beyond the conventional ranking-based rescoring methods, which traditionally employ a two-pass paradigm. In these methods, the first pass involves generating n-best hypotheses with an acoustic model, while the second pass reranks these hypotheses using a LLM. The proposed method, however, integrates acoustic data with external linguistic cues using distinct initialization techniques and a parameter-efficient algorithm, resulting in improved ASR performance.

Methodology

The framework brings together two pre-trained models: Whisper, a transformer-based acoustic model, and LLaMA, a large-scale LLM with decoder architecture. The integration is facilitated through a cross-modal generative error correction mechanism. This system functions by inputting encoded audio features and n-best hypotheses into a prompted LLM, aiming for enhanced transcription accuracy.

The paper details the architecture, including a unique cross-modal fusion mechanism that includes learnable adapters for parameter-efficient model fusion. An initialization strategy is employed to preserve the latent structures within the models, ensuring effective training. The authors provide a comprehensive account of how multiple Whisper models’ features are collapsed into a singular LLaMA model to leverage the strengths of each.

Results and Implications

Through extensive experiments on datasets such as ATIS and GigaSpeech, the method demonstrates a substantial improvement in word error rate (WER) performance. The presented approach shows a 37.66% WER relative improvement compared to traditional oracle methods, suggesting significant enhancements over standard two-pass systems.

The results indicate the model's capability to handle real-world noisy data effectively. Furthermore, the reported experimentation reveals that the framework operates efficiently even when the quality of hypotheses, generated by smaller or less capable acoustic models, is compromised. This shows the robustness of the framework in practical scenarios where state-of-the-art ASR systems might not be feasible.

Future Directions and Limitations

The implications of integrating LLMs for ASR error correction extend the potential for handling model scaling more efficiently, especially when reduced-resource or endangered languages are considered. However, the authors also acknowledge the computational demands of large model deployments, highlighting the significance of their parameter-efficient design, which reuses existing model components.

Future research may explore optimizing the training paradigm further to reduce computational overhead, and extending this framework to multimodal information (beyond acoustic and linguistic data) for enriched contextual understanding in ASR tasks.

Conclusion

Overall, the paper makes a significant contribution to enhancing ASR systems by effectively integrating acoustic and linguistic models in a cross-modal generative error correction framework. The approach not only demonstrates a meaningful leap in performance metrics but also sets a foundation for future innovations in utilizing LLMs for complex, multi-modal speech processing tasks. The availability of the source code further encourages expansive exploration and adoption by the research community.