ESPnet-SpeechLM Toolkit: Unified Speech & Text
- ESPnet-SpeechLM is an open-source toolkit that formulates both speech and text tasks as auto-regressive sequential prediction over discrete token streams.
- It features a modular design with YAML/JSON configurations, supporting diverse tokenization methods and scalable models from 360M to 1.7B parameters.
- Benchmark results in ASR, TTS, and singing voice synthesis demonstrate its effective multitask learning and reproducibility in speech processing.
ESPnet-SpeechLM is an open-source, highly modular toolkit for developing speech LLMs (SpeechLMs) and constructing voice-driven agentic applications. Architected atop ESPnet, it unifies the modeling of diverse speech and text tasks as a single auto-regressive sequential prediction problem over discrete token streams, enabling efficient multitask learning, standardization of workflows, and reproducibility across speech processing domains (Tian et al., 21 Feb 2025). It has been successfully extended to specialized domains, including singing voice synthesis, evidencing its adaptability and generalization capabilities (Zhao et al., 16 Dec 2025).
1. Foundational Design and Problem Formulation
ESPnet-SpeechLM adopts the paradigm that both speech and text tasks can be formulated as auto-regressive modeling of discrete token sequences. Inputs ("conditions") and "targets" are tokenized into discrete streams , which are concatenated with explicit task and tokenizer markers into a unified multi-stream sequence:
where TC indicates tokenizer assignment, and padding tokens enable stream alignment.
A decoder-only Transformer predicts
using the standard cross-entropy loss:
where is the number of streams (e.g., codec, SSL), and are per-stream weights for balancing frame/token disparities.
Multitask learning emerges naturally: batches are sampled from a mixture of task-specific datasets and their losses are summed by
where denotes relative weighting or sampling frequency.
The standardized workflow comprises four phases: data preprocessing (tokenization, construction of data.json), (multi-)pretraining, inference (greedy, beam, top-0/1 sampling with modality filters), and task-specific evaluation (using VERSA scripts).
2. Modular Architecture and Configurability
All core ESPnet-SpeechLM components are configured via a single extensible YAML/JSON recipe, providing a reproducible and highly flexible experimentation environment.
- Data Loader: Supports one or more data.json files per task, with configurable sampling ratios.
- Tokenizers: Supports ~30 text tokenization strategies (SentencePiece, HuggingFace, G2P), audio tokenizers (Espnet-Codec, DAC, Encodec, and SSL features from XEUS, Fairseq, S3PRL), and additional modalities (music, vision, labels, speaker IDs, LLM embeddings).
- Model Definition: Allows ESPnet’s native Transformer or integration with HuggingFace’s AutoModelForCausalLM; supports multi-stream LM wrappers (Delay-interleave, Parallel-interleave, MultiScale, Vall-E style) and loss choices (weighted cross-entropy, DPO-based RLHF).
- Optimization/Training: Employs DeepSpeed, FlashAttention, and Liger-Kernel for throughput and scale, with extensive hyperparameter support (LR schedules, warm-up, gradient accumulation).
- Inference: Implements constrained generation, so only valid modality tokens are generated according to stream-indicator logic.
- Evaluation: Leverages VERSA for over 60 speech/audio automatic metrics.
Task templates define conditions and targets for new task types, supporting new applications with minimal YAML modifications and corresponding directory alignment.
3. Scalability and Model Development
ESPnet-SpeechLM supports models from compact (∼360M parameters) to large-scale (1.7B parameters). The multitask 1.7B SpeechLM is jointly trained on four major tasks: ASR (speech→text), TTS (text→speech), TextLM (text continuation), and AudioLM (audio continuation), using the composite loss:
2
To equalize contributions from modalities, the relative weights are text : SSL : codec = 1 : 0.5 : 0.0625.
Scaling experiments used 213k hours of speech (∼33B audio frames) and 115B text tokens over two full epochs on 24×H100 GPUs, achieving >35% model flop utilization (MFU) due to DeepSpeed and FlashAttention optimizations. All training details, configurations, and scripts are version-controlled for transparent, repeatable workflows (Tian et al., 21 Feb 2025).
4. Benchmarks and Empirical Performance
ESPnet-SpeechLM achieves competitive or superior results compared to contemporary models across speech and natural language tasks. For single-task ASR, a 442M parameter model achieves lower WER than Whisper-medium on standard English test sets. For TTS on LibriSpeech Test-Clean, ESPnet-SpeechLM outperforms Parler-TTS and CosyVoice on WER and matches or surpasses on speaker similarity (SPK_SIM) and proxy mean-opinion-score (Proxy-MOS). The multitask 1.7B model is competitive with or exceeds models such as LLaMA-3.2, VoxtLM, and GLM-4-Voice in ASR, TTS, and text metrics, while requiring fewer parameters in some cases:
| Task/Metric | ESPnet-SpeechLM (1.7B) | GLM-4-Voice (9B) | VoxtLM (1.3B) |
|---|---|---|---|
| ASR WER % | 2.8 | 2.8 | 2.7 |
| TTS WER % | 6.0 | 5.6 | – |
| SPK_SIM | 0.701 | – | – |
| Proxy-MOS | 3.99 | – | – |
| AudioLM PPL ↓ | 16.4 | – | 40.9 |
These results underscore that unified sequential modeling effectively scales to new modalities and tasks without excessive model growth (Tian et al., 21 Feb 2025).
5. Domain Adaptation: Singing Voice Synthesis
The ESPnet-SpeechLM framework is extensible to non-conventional speech applications, as exemplified by its application to singing voice synthesis (SVS) (Zhao et al., 16 Dec 2025). In this setting, both music score conditions (phoneme, pitch, duration) and singing waveforms are tokenized for modeling. The SVS pipeline entails:
- Tokenization:
- Music score: Encoded as phoneme (Vph ≈ 70 for Mandarin) and pitch (Vpi = 128/256 for MIDI) sequences at 50 Hz frame rate.
- Waveform: Discretized using both codec (n_codec=8, Vcodec=1024) and semantic SSL tokens (Vssl=1024), producing 9 parallel streams per frame.
- Multi-stream Transformer LM:
- Architecture: 24-layer, 1024-dim model, 16 heads, 4096 FFN size, 0.1 dropout.
- Loss: Multi-stream cross-entropy:
3
- Optimizer: Adam (4; lr 5; ZeRO-2).
Conditional Flow Matching (CFM) for Mel-Spectrograms:
- Maps Gaussian noise to mels given LM-predicted codec embeddings and pitch.
- Training objective:
6
- U-Net style architecture, 10 conv blocks, FiLM conditioning.
Mel-to-wave Vocoder:
- HiFi-GAN trained/fine-tuned on target singing mels with loss:
7
(with 8, 9).
Results:
This adaptation demonstrates model extensibility to multi-modality settings with bespoke tokenizations (Zhao et al., 16 Dec 2025).
6. Workflow, Reproducibility, and Open Resources
ESPnet-SpeechLM enforces transparency and reproducibility. Every step from data preparation, tokenization, and training to evaluation is encoded in version-controlled shell scripts, YAML, and JSON config files. Researchers can fully reproduce published experiments or rapidly develop new tasks by creating minimal task templates and aligning folder structure. All recipes, configs, data prep scripts, and pretrained models are provided via the ESPnet GitHub repository and HuggingFace hub:
A canonical workflow is:
- Install toolkit and dependencies
- Prepare dataset structured for ESPnet convention (e.g., wav.scp, text)
- Define task in YAML (task_templates)
- Preprocess data to data.json
- Train (single/multitask) model
- Run inference (constrained decoding)
- Evaluate with VERSA scripts
Each component is designed so that new modalities, tokenizers, or tasks can be added with concise changes, supporting rapid scaling and extensibility. Full transparency and reproducibility are explicit design tenets (Tian et al., 21 Feb 2025).
7. Impact and Research Directions
ESPnet-SpeechLM’s unification of speech and text tasks under a single auto-regressive, token-centric design creates a common platform for speech recognition, synthesis, translation, and more. The universal sequential modeling formulation, flexible multi-stream tokenization, and efficient parallelism mechanisms enable research on large-scale, multitask, and multi-modal SLMs without codebase fragmentation. Successful adaptation to domains such as singing voice synthesis with minimal overhead suggests further applicability to music, cross-modal, and agentic settings (Tian et al., 21 Feb 2025, Zhao et al., 16 Dec 2025). A plausible implication is the potential for even broader convergence across audio, speech, and language processing within a single, extensible modeling framework.