Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JoyTTS: LLM-Driven Voice Cloning

Updated 4 July 2025
  • JoyTTS is an end-to-end spoken chatbot framework that fuses a large language model with advanced text-to-speech for precise voice cloning.
  • The framework employs dual-modality tokenization and a two-phase training strategy to optimize dialog coherence and speech synthesis quality.
  • Its open-source, modular design enables personalized conversational agents and accessible interactive voice systems for diverse research applications.

JoyTTS is an end-to-end spoken chatbot framework that integrates a LLM with advanced text-to-speech (TTS) technology, with a primary focus on high-fidelity voice cloning and open research extensibility. Built on the MiniCPM-o and CosyVoice2 models, and trained on 2000 hours of conversational data, JoyTTS enables LLM-driven dialogue generation alongside subject-adaptable speech synthesis, making it suitable for conversational agents where speaker identity and personalized interaction are essential. The training code, inference scripts, checkpoints, and models are publicly available for academic and industrial research.

1. System Architecture and Module Integration

JoyTTS consists of four principal modules, each serving a distinct but interconnected function in the LLM-TTS pipeline:

  1. Tokenizer Module Converts both text and audio into tokens, enabling dual-modality inputs crucial for voice cloning tasks.
  2. LLM-Chat Module Implements the MiniCPM-o base (utilizing the Qwen-7B backbone). This module processes incoming text and audio tokens, outputting both the next text token (for dialog continuation) and the set of hidden states (hih_i) for each time step, providing semantic context for subsequent TTS conditioning.
  3. LLM-TTS Module Replaces the GPT-SoVITS component in MiniCPM-o with CosyVoice2, leveraging both text token embeddings and mapped LLM hidden states. The TTS embedding for synthesis is computed as:

TTSembed=Emb(yi)+MLP(hi)TTS_{embed} = Emb(y_i) + \mathrm{MLP}(h_i)

where Emb(yi)Emb(y_i) is the embedding for output token yiy_i, and MLP(hi)\mathrm{MLP}(h_i) is a multi-layer perceptron mapping hidden states from 3584 to 768 dimensions.

  1. Generator Module Converts speech tokens into mel spectrograms and subsequently into waveform audio, completing the speech output chain.

During TTS inference, JoyTTS uses both prompt text and prompt audio as inputs, computing embeddings that enable the system to closely mimic the vocal traits of the prompt audio, thereby achieving robust voice cloning. Training involves explicit pairing of text and audio prompts from multiple speakers to expose the model to a broad distribution of speaker characteristics.

2. Training Procedure and Data

JoyTTS employs a two-phase training strategy:

  • Phase 1: Separate Module Pretraining The LLM-Chat and LLM-TTS modules are pretrained independently. LLM-Chat focuses on dialog semantics, while LLM-TTS is optimized for high-fidelity speech synthesis conditioned on both text and hidden state information.
  • Phase 2: Joint Training Modules are integrated and jointly fine-tuned to maximize synergy between natural language understanding and audio generation. The loss is formulated as:

Loss=LLLMChat+LLLMTTS\text{Loss} = L_{LLM-Chat} + L_{LLM-TTS}

ensuring that both dialog accuracy and speech quality (including speaker identity preservation) are optimized simultaneously.

The training set comprises 2000 hours (400,000 samples) of conversational data drawn from open resources, specifically RedGPT and GeneratedChat0.4M. Audio for training is synthesized with CosyVoice2, and speaker diversity is maintained by randomly sampling prompt pairs from WenetSpeech4TTS during TTS rendering. Augmentation involves variable-length chunking and insertion of punctuation for nuanced audible segmentation.

3. Voice Cloning and Personalization

Voice cloning in JoyTTS is instantiated by conditioning the speech synthesis module on both the content of prompt text and the acoustic features of prompt audio, as encoded by the shared tokenizer and subsequent network layers. This dual conditioning provides the system with explicit cues for both linguistic content and speaker identity.

During training, diverse prompt pairs allow the network to learn associations between phonetic structure, prosody, and individual vocal timbre. At inference, users provide a brief audio sample to serve as the reference for cloning; subsequent synthesized responses are generated to match the vocal qualities—such as timbre, cadence, and prosodic style—of the given speaker.

Performance is assessed with standard TTS evaluation protocols. On the SEED test-zh benchmark, JoyTTS achieves:

  • Speaker Similarity (SS) = 0.73 (cf. CosyVoice2: 0.748, GPT-SoVITS: 0.55)
  • Word Error Rate (WER) = 5.09 (cf. CosyVoice2: 1.45, GPT-SoVITS: 5.13)

These results indicate that JoyTTS provides high-fidelity voice reproduction competitive with specialist TTS models while preserving low transcription error rates.

4. Performance, Evaluation, and Latency

JoyTTS is tested on the SEED test-zh set, with evaluation focusing on both speaker similarity and intelligibility:

  • Speaker similarity is quantitatively measured as the mean cosine similarity of embeddings between reference and synthesized utterances.
  • Word Error Rate measures the effective ASR recognition quality, indicative of output intelligibility.

Performance summary:

Model Speaker Similarity (SS) ↑ Word Error Rate (WER) ↓
GPT-SoVITS 0.55 5.13
CosyVoice2 0.748 1.45
JoyTTS 0.73 5.09

On an NVIDIA 4090D system, typical inference latency is 1.8 seconds, making JoyTTS viable for interactive or near-real-time spoken dialog applications.

5. Open Source Codebase and Development

The full JoyTTS stack is open-source at https://github.com/jdh-algo/JoyTTS.git, containing:

  • All model code and architectural definitions
  • Training scripts for both pretraining and joint optimization
  • Inference pipelines for real-time deployment and evaluation
  • Model checkpoints compatible with the provided trainers

This availability supports reproducible research and simplifies further optimization or adaptation for new domains. The modular code structure allows for rapid experimentation in integrating alternative LLMs, modifying the TTS backbone, or extending multilingual capabilities.

6. Applications and Research Directions

JoyTTS’s architecture and fidelity enable multiple applications:

  • Conversational Voice Assistants: Personalized, speaker-adaptable agents for task completion, user support, or entertainment.
  • Accessibility Technologies: Synthetic voice generation for users with speech impairments, supporting personalized vocal identity.
  • Storytelling and Media: Creation of branded voice experiences in immersive multimedia content.
  • Human-Computer Interaction Research: Study of user engagement with LLM-powered conversational agents possessing individualized voice.
  • Emotional AI: (A plausible implication is expansion) via new modules for emotion control or contextual style modulation as LLM inputs.

Future development tracks identified include:

  • Integration of explicit emotion control variables for expressive speech synthesis
  • Latency reduction for real-time deployment at scale
  • Multi-language expansion by retraining with additional data
  • Reinforced robustness for operation in noisy or low-resource environments

7. Significance

JoyTTS demonstrates the feasibility of combining an LLM-centric approach with advanced, speaker-adaptable neural TTS, achieving both dialog coherence and high-fidelity voice cloning within a single end-to-end trainable pipeline. Its open licensing and modularity allow for benchmarking, extension, and adaptation in diverse spoken language AI research contexts. The robust performance and reproducible training recipes position JoyTTS as a foundational framework for future research and development in personalized, conversational speech synthesis.