ESPnet-SpeechLM: An Open Speech Language Model Toolkit (2502.15218v2)

Published 21 Feb 2025 in cs.CL, cs.SD, and eess.AS

Abstract: We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech LLMs (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.

Summary

The paper introduces ESPnet-SpeechLM, an open-source toolkit designed to streamline the development of Speech Language Models and voice-driven applications.
ESPnet-SpeechLM offers a cohesive workflow from data preprocessing to evaluation, supporting numerous tasks and integrating tools from the text LLM ecosystem.
Performance demonstrations show ESPnet-SpeechLM achieving competitive results in tasks like Automatic Speech Recognition and Text-to-Speech compared to existing models.

The paper introduces ESPnet-SpeechLM, an open-source toolkit designed to streamline the development of Speech LLMs (SpeechLMs) and voice-driven agentic applications by framing speech processing tasks as universal sequential modeling problems. The toolkit offers a cohesive workflow encompassing data preprocessing, pre-training, inference, and task evaluation. Users can define task templates and configure settings for SpeechLM development. The toolkit provides configurable modules for each stage, ensuring flexibility, efficiency, and scalability.

The ESPnet-SpeechLM toolkit builds upon prior work in two main directions: text LLM ecosystem and open-sourced SpeechLMs and speech toolkits. Popular development tools in text LLM ecosystems like DeepSpeed and FlashAttention are integrated, and the toolkit is presented as a supplement to text LLM training frameworks that provide limited support for speech features. It combines SpeechLM research with speech processing techniques within the open-sourced community, built upon the existing ESPnet codebase to leverage community efforts and enable comparisons with non-SpeechLM works.

The toolkit supports a variety of features, including:

Task Templates: TextLM, AudioLM, Text-to-Speech, Automatic Speech Recognition, Machine Translation, Speech-to-Text Translation, Speech-to-Speech Translation, Text-to-Audio, Text-to-Music, Audio Caption, Singing Voice Synthesis, Speech Enhancement, Target Speaker Extraction, and Visual TTS (Text-to-Speech).
Tokenization:
- Text: Subword models (SentencePiece, HuggingFace Tokenizers), G2P (~30 choices).
- Audio: Codec (ESPnet-Codec, DAC, Encodec, UniAudio), SSL (Self-Supervised Learning) (XEUS, S3PRL, FairSeq), and combined Codec_SSL.
- Others: Music Score, Vision Token, Classification Label, Speaker Identity, and LLM Embeddings.
Modeling and Training:
- Transformer Body: ESPnet Builtin, HuggingFace AutoModelForCausaLLM.
- Multi-Stream: Vall-E, MultiScale-Transformer.
- LLM: Parallel Interleave, Delay Interleave.
- Efficiency: DeepSpeed, FlashAttention, Liger-Kernel.
Inference: Greedy Search, Beam Search, Top-k Sampling, Top-p Sampling.
Evaluation: VERSA, with 61 evaluation metrics for speech and audio.
Sharing: Task Template (ESPnet GitHub Repository), Datasets and Models (ESPnet HuggingFace Hub).

The paper details the SpeechLM workflow, which begins with single-task scenarios and extends to multitask training. The task template defines the composition of the training sequence, specifying the conditions, targets, item names, and tokenizers for each data item. Preprocessing involves offline tokenization, handled automatically by ESPnet-SpeechLM, which generates a unified data.json file for each dataset and constructs a joint vocabulary. The training behavior is specified by a configuration file that supports flexible model architecture configurations, multi-stream LLM implementations, custom weights for loss computing, and reinforcement learning from human feedback (RLHF (Reinforcement Learning from Human Feedback)). Inference methods include greedy search, beam search, and top-k/top-p sampling, with heuristics to filter out tokens from other modalities. Evaluation scripts adopt VERSA, a collection of >60 speech and audio evaluation metrics. Multitasking is achieved by fusing training sequences from different tasks in mini-batches, with adjustable sampling ratios.

The toolkit's performance is demonstrated through several use cases:

ASR (Automatic Speech Recognition): Achieves comparable results in English with Whisper-v3-large and OWSM v3.1-medium, even with fewer parameters. For instance, on the LS-Clean dataset [librispeech], ESPnet-SpeechLM attains a 1.9% word error rate (WER), outperforming Whisper-small (3.3%) and OWSM v3.1-small (2.5%).
TTS (Text-to-Speech): Achieves competitive performance on LibriSpeech Test-Clean, with a WER of 3.1%, a speaker similarity (SPK_SIM) of 0.55, and a proxy MOS (Mean Opinion Score) of 4.03.
Multi-task Pre-trained SpeechLM: A 1.7B parameter model covering ASR, TTS, TextLM, and AudioLM tasks. The model achieves a WER of 2.8% / 5.9% for ASR, a WER of 6.0%, SPK_SIM of 0.701, and proxy MOS of 3.99 for TTS, and a perplexity of 16.4 for AudioLM. TextLM ability is close to LLaMA-3.2-1B [llama].

The authors express interest in future development of the ESPnet-SpeechLM toolkit, such as supporting more tokenization methods, task templates, modeling options, and LLM inference engines. They also plan to apply the toolkit to SpeechLM research, focusing on larger-scale models, paralinguistic information capture, conversational interactions, speech-based instruction following, and agent-like behaviors. The plan also includes real-time and duplex design, HFRL (Human Feedback Reinforcement Learning) for SpeechLM and SpeechLMs trained from flat start.

The training data includes 213k hours of open-source speech data and 115.69 billion tokens of text data from general web content (FineWeb-EDU), multilingual text from the Multilingual CC News dataset, and code-centric data from the OpenCoder Annealing Corpus.

PDF Markdown

ESPnet-SpeechLM: An Open Speech Language Model Toolkit (2502.15218v2)

Summary

Related Papers

GitHub