Fish Audio S2: Open-Source Multilingual TTS
- Fish Audio S2 is an open-source, multilingual text-to-speech system designed for multi-speaker, multi-turn generation with precise, instruction-following vocal control.
- It integrates advanced modules including a hierarchical audio tokenizer, dual autoregressive generators, and a reinforcement learning-based reward alignment pipeline.
- The system supports ultra-low latency streaming and long-form coherence, achieving state-of-the-art performance on benchmarks and facilitating flexible, expressive speech synthesis.
Fish Audio S2 is an open-source, multilingual text-to-speech (TTS) system architected to support multi-speaker, multi-turn generation and fine-grained, instruction-following vocal control through natural-language descriptions. The system integrates a multi-stage training and data curation pipeline, reinforcement learning (RL) for reward-based alignment, and a production-grade SGLang-based inference backend. The design objective prioritizes streaming synthesis at ultra-low latency, long-form coherence, and precise prosodic and stylistic control, all without necessitating per-voice retraining. Fish Audio S2 advances the open-source TTS frontier by releasing code, model weights, fine-tuning scripts, and interactive demos under unrestricted terms (Liao et al., 9 Mar 2026).
1. Architecture and Major Components
The Fish Audio S2 system comprises four principal modules:
- Audio Tokenizer (encoder):
- Implements a Descript Audio Codec (DAC) foundation with a 10-layer hierarchical Residual Vector Quantizer (RVQ). It utilizes causal convolutions and a sliding-window Transformer bottleneck for 44.1 kHz output at 21 Hz frame rate, with a total parameter count of 446M.
- Dual-Autoregressive Generator (decoder):
- Slow AR: Qwen3-4B (~4B parameters), autoregressively models mixed textual and semantic audio token streams.
- Fast AR: 4-layer Transformer (~50M parameters) autoregressively generates the nine remaining acoustic RVQ layers, conditioned on the output of the slow AR module.
- Control and Reward Modules:
- Reinforcement-Learning Alignment:
- A multi-reward instantiation of Group Relative Policy Optimization (GRPO) aligns generation quality along axes of semantic accuracy, acoustic preference, and speaker similarity.
| Module | Model/Architecture | Parameter Count |
|---|---|---|
| Audio Tokenizer | DAC (RVQ, Transformer) | 446M |
| Slow AR Generator | Qwen3-4B | ≈4B |
| Fast AR Generator | 4-layer Transformer | ~50M |
| ASR/Reward | Qwen3-Omni-30B-A3B, w2v-BERT | 30B, + MLP head |
The system eliminates the need for explicit/local control adapters by embedding vocal and stylistic cues inline within the token stream, leveraging joint multimodal modeling for controllable synthesis.
2. Data Curation, Preprocessing, and Training Pipeline
Fish Audio S2's training relies on an integrated, multi-stage data pipeline:
- Stage 1: Source separation and VAD are employed to extract utterance-level segments from large-scale web video and podcast corpora.
- Stage 2: Low-quality or overlapped utterances are filtered out via scores from the trained speech-quality model.
- Stage 3: Each audio segment is annotated using the rich-transcription ASR; outputs serve as both supervised captions and reward signals for RL.
The model is trained in three stages:
- Audio Tokenizer Pre-training: GAN-based, using multi-period, multi-resolution, and STFT discriminators, running for ~1M steps.
- Large-Scale Pre-training and Supervised Fine-Tuning: The base Qwen3-4B is extended with 4096 audio semantic tokens and structural control tokens. Training utilizes 10M hours of speech data across ≈80 languages, for a total of 500B tokens (with a 70:30 split of speech+audio to pure text).
- RL-Based Post-Training:
- Group-level GRPO without a value network.
- The reward is computed as ; : ASR confidence and instruction adherence, : speech-quality model, : cosine similarity from a voiceprint model.
- KL divergence is computed using a CPU-resident LoRA swap (rank 16, ; MLP layers only).
Key hyperparameters include token pretraining for ~1M steps, context windows of 8192–16384, and group sizes of –16 for RL batch evaluation.
3. Instruction-Following and Vocal Control
Fish Audio S2 encodes structural control and natural-language vocal instructions inline within the text input—examples include emotion ([angry]), prosody ([whispers]), and speaker changes. The model is trained to transduce these instructions to corresponding acoustic modulations at word-level granularity, supporting zero-shot control without per-voice retraining.
- Control tokens include both speaker markers and 4096 semantic audio tokens (derived from RVQ codebooks).
- Inference leverages SGLang and multi-token RadixCache keys to retain and reuse combined semantic and acoustic context over long sequences.
- Fine-grained instruction following is measured through benchmarked Tag Activation Rate (TAR), acoustic naturalness, and expressiveness scores across English and Chinese speech corpora, with major improvements over Fish Audio S1 (TAR: english 0.881 → S1 0.626, Chinese 0.984 → S1 0.942).
4. Inference Infrastructure and Production Deployment
Inference utilizes an SGLang-based backend that supports:
- Continuous batching, paged KV cache, CUDA graph replay, and RadixAttention for efficient context/prefix reuse.
- Customized I/O to handle interleaved text and acoustic tokens, with multi-token RadixCache keys maintaining semantic+acoustic state.
- CUDA MPS co-scheduling is implemented to run vocoder and LLM decoding in parallel.
On NVIDIA H200 hardware, measured performance is:
- Real-Time Factor (RTF): 0.195 (approximately 5× faster than real time).
- Time-To-First-Audio (TTFA): <100 ms.
- Token throughput: >3000 acoustic tokens/sec with RTF < 0.5 at high concurrency.
- Prefix-cache hit rate: 86.4% (facilitating reference-audio reuse in highly interactive settings).
5. Evaluation and Comparative Results
Fish Audio S2 establishes new state-of-the-art results across several benchmarks:
| Benchmark | Metric & Task | Fish Audio S2 | Prior Best (CosyVoice3) |
|---|---|---|---|
| Seed-TTS-Eval (voice cloning) | WER (%) en/zh | 0.99/0.54 | 2.21/1.12 |
| Minimax Multilingual (24 langs) | Lowest WER≥lang | 11/24 | |
| Minimax Multilingual | Highest SIM≥lang | 17/24 | |
| CV3-Eval (9 langs) | Avg. WER (%) | 3.01 | 3.96 (Fish Audio S₁) |
| Long-Audio English | WER / SIM-mean | 4.38 / 0.523 | |
| Long-Audio Chinese | CER / SIM-mean | 5.95 / 0.557 | |
| Fish Audio Instruction Benchmark | TAR (En/Ch) | 0.881/0.984 | 0.626/0.942 |
| Fish Audio Instruction Benchmark | Naturalness (1–5) | 4.21/4.40 | 3.71/4.15 |
Audio Turing Test (LLM-juried): posterior mean 0.483 ± 0.068 (improves to 0.515 ± 0.061 with instruction rewriting). In emergent TTS evaluation across five dimensions, Fish Audio S2 achieves an overall win rate of 81.88% versus a gpt-4o-mini-tts baseline (Liao et al., 9 Mar 2026).
6. Open-Source Release and Accessibility
All code, pretrained weights, fine-tuning scripts, and an interactive inference engine are available at:
- GitHub: https://github.com/fishaudio/fish-speech
- Hugging Face: https://huggingface.co/fishaudio/s2-pro
- Interactive demos: https://fish.audio
Deployment utilizes production-ready APIs for streaming scenarios and supports both batch and fine-tuning usage with released scripts.
7. Significance and Research Context
Fish Audio S2 establishes a fully open, high-performance foundation for controllable, expressive, streaming TTS—addressing limitations of speaker adaptation, zero-shot style transfer, and fine-grained instruction following present in prior systems. The tight integration of large-scale RL-based alignment using composite, multi-axis reward modeling and scalable data curation sets a new reference for open-source multilingual TTS frameworks.
A plausible implication is enhanced adoption for research, real-time conversational agents, voice assistants, and accessibility technologies where nuanced vocal and stylistic control, rapid generation, and transparent scientific assessment are critical. Moreover, the model's architecture and open release enable direct extension for cross-modal or task-oriented speech synthesis pipelines (Liao et al., 9 Mar 2026).