Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning (2501.00039v1)

Published 25 Dec 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: We introduce a LLM capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the LLM, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using LLMs.

Summary

The paper presents a novel method that replaces low-frequency tokens with audio-based tokens to improve disordered speech recognition.
The study shows that using reinforcement learning with human feedback significantly reduces Word Error Rate and enhances semantic accuracy.
The approach extends LLM capabilities to multimodal ASR, paving the way for inclusive, accessible speech technologies.

An Analysis of Speech Recognition with LLMs Adapted to Disordered Speech Using Reinforcement Learning

The paper "Speech Recognition with LLMs Adapted to Disordered Speech Using Reinforcement Learning" presents an innovative approach in the domain of automatic speech recognition (ASR) that leverages LLMs. Specifically, this research introduces a methodology for enhancing LLMs to recognize and accurately transcribe disordered speech, employing reinforcement learning mechanisms to fine-tune the models based on semantic and syntactic accuracy measures.

Methodological Contributions

The authors propose a novel adaptation of LLMs for ASR tasks by replacing low-frequency text tokens in the LLM's existing vocabulary with tokens derived from audio clusters. This transformation allows the LLM to integrate speech recognition capabilities with its inherent language understanding features, maintaining a transformer architecture without significant modifications. The model is trained using a two-step process. Initially, a generic LLM is fine-tuned on a dataset that includes both standard and disordered speech, using a blend of LibriSpeech (a corpus of clean speech) and Euphonia (a corpus of impaired speech). Following this, an additional tuning phase employs reinforcement learning with human feedback (RLHF) to enhance the model's ability to preserve the semantic intent of the speech.

Results and Discussion

One of the paper's significant findings is the efficacy of incorporating reinforcement learning to optimize the model's performance on disordered speech recognition. The research demonstrates that reward signals based on both Word Error Rate (WER) and Meaning Preservation (MP) scores considerably improve the model's adaptability to disordered speech. Notably, assigning equal weights to these metrics during RLHF yields notable improvements in semantic accuracy without substantially compromising syntactic accuracy. This insight highlights the potential of balancing structural accuracy with semantic integrity in ASR systems.

The experiments reveal that the resulting LLM-ASR model outperforms conventional fine-tuning techniques, especially when dealing with linguistically altered speech patterns common in disorders. The methodology achieves statistically significant reductions in error rates across various test scenarios, notably in datasets marked with higher levels of speech impairment severity.

Implications and Future Directions

From a theoretical standpoint, this work expands the capabilities of transformer-based LLMs, demonstrating their potential beyond textual tasks to audio-intensive applications. The hybrid approach of integrating discrete audio embeddings into LLMs without architectural changes offers a new dimension for multimodal AI systems, particularly benefiting accessibility-focused technology. By effectively utilizing RLHF, the research addresses a crucial gap in aligning machine-generated transcripts with human semantic intent, a common challenge in ASR systems.

The implications of this approach suggest several avenues for future research. Scaling these methods to larger and more diverse datasets, as well as testing across different languages and speech varieties, could generalize the findings further. Exploring advanced techniques for audio token discretization and integrating alternative reward models may enhance model performance further, extending the practicality of these systems in real-world applications.

In conclusion, the authors have provided a compelling alternative to traditional ASR tuning strategies, showcasing how LLMs can be successfully adapted to process and understand complex speech scenarios through reinforcement learning. This paper lays a foundation for future studies aiming to optimize ASR systems under challenging linguistic conditions, thus holding promise for broadening the utility of LLMs in inclusive and accessible technological solutions.