MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition (2405.03152v1)
Abstract: Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of LLMs (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-accent scenarios. Experiments conducted on the multi-accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.
- “Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies,” in Proc. ICASSP, 2024, pp. 11396–11400.
- “Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 459–470, 2023.
- “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018, pp. 1–5828.
- “Deliberation model based two-pass end-to-end speech recognition,” in Proc. ICASSP, 2020, pp. 7799–7803.
- “Component fusion: Learning replaceable language model component for end-to-end speech recognition system,” in Proc. ICASSP, 2019, pp. 5361–5635.
- “Cold fusion: Training seq2seq models together with language models,” arXiv preprint arXiv:1708.06426, 2017.
- “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” Advances in Neural Information Processing Systems, vol. 34, pp. 21708–21719, 2021.
- “Asr error correction and domain adaptation using machine translation,” in Proc. ICASSP, 2020, pp. 6344–6348.
- “Softcorrect: Error correction with soft detection for automatic speech recognition,” in Proc. AAAI, 2023, pp. 13034–13042.
- “N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space,” arXiv preprint arXiv:2303.00456, 2023.
- “Hyporadise: An open baseline for generative speech recognition with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- “Generative error correction for code-switching speech recognition using large language models,” arXiv preprint arXiv:2310.13013, 2023.
- “Large language models are efficient learners of noise-robust speech recognition,” in Proc. ICLR, 2024.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2022.
- “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28492–28518.
- “Everyone has an accent,” in Proc. INTERSPEECH, 2023, pp. 4424–4427.
- “Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning,” in Proc. INTERSPEECH, 2018, pp. 2454–2458.
- “E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition,” in Proc. INTERSPEECH, 2021, pp. 1519–1523.
- “Multilingual speech recognition with a single end-to-end model,” in Proc. ICASSP, 2018, pp. 4904–4908.
- “Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition,” in Proc. INTERSPEECH, 2022, pp. 3719–3723.
- “The accented english speech recognition challenge 2020: open datasets, tracks, baselines, results and methods,” in Proc. ICASSP, 2021, pp. 6918–6922.
- “Layer-wise fast adaptation for end-to-end multi-accent speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2842–2853, 2022.
- “Dialect-aware modeling for end-to-end Japanese dialect speech recognition,” in Proc. APSIPA ASC, 2020, pp. 297–301.
- “Kespeech: An open source speech dataset of mandarin and its eight subdialects,” in Proc. NeurIPS Datasets and Benchmarks Track, 2021.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
- “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. EMNLP, 2014, pp. 1724–1734.
- “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023.
- “WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit,” in Proc. INTERSPEECH, 2021, pp. 4054–4058.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. INTERSPEECH, 2020, pp. 5036–5040.
- “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 5998–6008.
- “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
- “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
- Bingshen Mu (8 papers)
- Yangze Li (11 papers)
- Qijie Shao (12 papers)
- Kun Wei (23 papers)
- Xucheng Wan (12 papers)
- Naijun Zheng (8 papers)
- Huan Zhou (51 papers)
- Lei Xie (337 papers)