Generative Context-aware Fine-tuning of Self-supervised Speech Models (2312.09895v1)
Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative LLMs (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, LLM could generate a prediction of the next sentence or abstractive text like titles or topics. In this paper, we study the use of LLM-generated context information and propose an approach to distill the generated information during fine-tuning of self-supervised speech models, which we refer to as generative context-aware fine-tuning. This approach allows the fine-tuned model to make improved predictions without access to the true surrounding segments or to the LLM at inference time, while requiring only a very small additional context module. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis. The results show that generative context-aware fine-tuning outperforms a context injection fine-tuning approach that accesses the ground-truth previous text, and is competitive with a generative context injection fine-tuning approach that requires the LLM at inference time.
- “Advanced long-context end-to-end speech recognition using context-expanded transformers,” arXiv preprint arXiv:2104.09426, 2021.
- “Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8509–8513.
- “Conversational speech recognition by learning conversation-level characteristics,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6752–6756.
- “Leveraging acoustic contextual representation by audio-textual cross-modal learning for conversational asr,” arXiv preprint arXiv:2207.01039, 2022.
- “Acoustic-to-word models with conversational context information,” arXiv preprint arXiv:1905.08796, 2019.
- “Cross-attention end-to-end asr for two-party conversations,” arXiv preprint arXiv:1907.10726, 2019.
- “Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems,” in Proc. INTERSPEECH 2023, 2023, pp. 2223–2227.
- “Context-aware end-to-end asr using self-attentive embedding and tensor fusion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- “Context-aware fine-tuning of self-supervised speech models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- “Stable beluga models,” .
- “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
- “Slue: New benchmark tasks for spoken language understanding evaluation on natural speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7927–7931.
- “Librispeech: An ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
- “fairseq: A fast, extensible toolkit for sequence modeling,” in NAACL demo, 2019.
- “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
- “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing,” arXiv preprint arXiv:2111.09543, 2021.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
- Suwon Shon (31 papers)
- Kwangyoun Kim (18 papers)
- Prashant Sridhar (10 papers)
- Yi-Te Hsu (7 papers)
- Shinji Watanabe (416 papers)
- Karen Livescu (89 papers)