- The paper introduces a novel framework that integrates text and speech to create role-playing agents with consistent, character-specific vocal traits.
- It employs a two-stage training strategy with a Speech-Language Collaborative Model and a dedicated Role Speech Decoder, achieving a competitive 289ms response latency.
- Experimental results demonstrate superior language understanding, enhanced voice synthesis quality, and improved human-evaluated immersion compared to existing models.
The paper "OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction" (2505.20277) introduces a novel framework for creating Role-Playing Agents (RPAs) that go beyond text-based interactions by incorporating distinct vocal traits for each character. The core idea is to make RPAs more immersive by enabling them to exhibit consistent personalities through both their language and their speech, including voice style and emotions.
The proposed system, OmniCharacter, aims to achieve this with low latency. It consists of two main components:
- Speech-Language Collaborative Model: This serves as the base, processing both language inputs (like role profiles, dialogue history, and user text) and speech inputs (user's spoken query).
- Speech Encoder: Converts raw audio input XnS into a sequence of speech frames MnS. The paper uses Whisper-large-v3 for this.
- Speech Adaptor: Reduces temporal redundancy in the encoded speech sequence by grouping k consecutive frames and then transforms this compact sequence ZnS into an embedding EnS compatible with the LLM.
LLM: The paper utilizes Qwen2.5-7B-Instruct. It takes concatenated text embeddings EnT (from profile P, context Un, and text query XnT) and speech embeddings EnS. The LLM is trained auto-regressively to predict textual responses OT using the loss function:
Llanguage=−i=1∑NtlogP(OiT∣O<iT)
- Role Speech Decoder: This component generates role-specific speech responses with unique vocal traits.
Role-context Guided Speech Token Prediction: Instead of having the main LLM predict both text and speech tokens directly (which can lead to instability and hallucinations), a lightweight LLM (SpeechLLM, specifically Qwen2.5-0.5B-Instruct) is used. This SpeechLLM takes the hidden state representations H from the main LLM as context (passed through a linear projection ϕ) to predict speech tokens OS:
OS=SpeechLLM(OmS∣O1:m−1S,ϕ(H))
It is trained with an auto-regressive objective:
Lspeech=−i=1∑NslogP(OiS∣O<iS)
A speech tokenizer with a 16K vocabulary size (from GLM-4-Voice) is used.
Role-aware Speech Synthesis: This module converts the predicted speech tokens OS into an audio waveform YnS. It first decodes speech tokens into a Mel spectrogram using a conditional flow matching (CFM) model, conditioned on LLM context H, speaker embedding v (extracted using CAM++ from character voice), and speech tokens OS:
YnS=OT-CFM(p0∣H,v,OS)
A HiFi-GAN vocoder then synthesizes the final waveform from this Mel spectrogram.
The overall architecture of OmniCharacter is illustrated below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
graph TD
subgraph User Input
U_Speech[User Speech Input X_i^S]
U_Text[User Text Input X_i^T]
end
subgraph System Input
RoleProfile[Role Profile P]
DialogContext[Dialogue Context C_n]
end
subgraph Speech-Language Collaborative Model
SpeechEncoder[Speech Encoder (Whisper-large-v3)]
SpeechAdaptor[Speech Adaptor τ]
TextEncoder[Text Encoder]
LLM[LLM (Qwen2.5-7B-Instruct)]
U_Speech --> SpeechEncoder
SpeechEncoder --> SpeechAdaptor
SpeechAdaptor -- E_n^S --> LLM
U_Text -- E_n^T --> LLM
RoleProfile -- E_n^T --> LLM
DialogContext -- E_n^T --> LLM
LLM -- LLM Hidden States H --> RoleContextGuidedSTP
LLM -- Text Tokens O^T --> AgentTextResponse[Agent Text Response Y_n^T]
end
subgraph Role Speech Decoder
RoleContextGuidedSTP[Role-context Guided Speech Token Prediction (SpeechLLM: Qwen2.5-0.5B-Instruct)]
RoleAwareSpeechSynth[Role-aware Speech Synthesis (OT-CFM + HiFi-GAN Vocoder)]
SpeakerEmb[Speaker Embedding v]
RoleContextGuidedSTP -- Speech Tokens O^S --> RoleAwareSpeechSynth
SpeakerEmb --> RoleAwareSpeechSynth
LLM -- LLM Hidden States H --> RoleAwareSpeechSynth
RoleAwareSpeechSynth -- Waveform Y_n^S --> AgentSpeechResponse[Agent Speech Response Y_n^S]
end
AgentTextResponse --> Output
AgentSpeechResponse --> Output |
Training Strategy:
A two-stage approach is used:
- Stage 1: Train the speech adaptor and LLM for text response generation based on text and speech inputs (Llanguage).
- Stage 2: Freeze Stage 1 components and train only the linear layer and SpeechLLM for speech token prediction (Lspeech). The role-aware speech synthesis module uses pre-trained weights from GLM-4-Voice and is fine-tuned on high-quality role speech data.
OmniCharacter-10K Dataset:
To train and evaluate OmniCharacter, the authors created a new dataset, OmniCharacter-10K.
1. Character Profile Creation: 20 characters (10 Chinese, 10 English) from Genshin Impact were selected. Profiles detailing personality, voice style, relationships, and experiences were generated using an LLM and human-verified.
2. Dialogue Generation: Two LLMs interacted in a simulated conversation, one as the target character and the other as a user/different character, guided by profiles.
3. Speech Synthesis: Character speech was synthesized using a VITS model trained on 40K high-quality audio samples extracted from game assets. User speech was synthesized using CosyVoice, with a 50/50 male/female voice distribution.
4. Quality Verification: Text data was filtered for ABAB patterns and dialogues longer than three turns. Speech data was filtered based on Word Error Rate (WER < 10 using Whisper-large-v3) and speaker similarity (> 0.8 using WavLLM).
- Properties:
- Large Vocabulary: 20 characters, 10,072 multi-turn dialogues, 135K audio responses.
- Rich Annotations: Detailed character profiles and corresponding speech for dialogues.
- Dynamic Curation: Dialogues generated via chatbot interactions.
- The training set has 9,672 samples (360.3 speech hours), and the test set has 400 samples (14.84 speech hours). Over 80% of dialogues are longer than 10 turns.
Implementation and Deployment Considerations:
- Computational Requirements: Training was done on 8xA100 GPUs for 3 epochs.
- Latency: OmniCharacter achieves a speech response latency of 289ms, which is competitive (e.g., GPT-4o at 320ms, LLaMA-Omni at 226ms). The slight increase over some models is attributed to the additional modules for character-specific voice traits.
- Modularity: The two-stage training and distinct components (speech-LLM, role speech decoder) allow for potentially independent upgrades or fine-tuning of parts of the system.
- Scalability: While the dataset uses 20 characters, the framework is designed to potentially scale to more characters, provided sufficient character-specific voice data and profiles are available. The quality of speaker embeddings is crucial for distinguishing characters.
Applications:
This research enables more realistic and engaging RPAs for:
- Virtual Assistants: Assistants with distinct, persistent personalities and voices.
- AI-driven Storytelling: Characters in interactive narratives that can speak their lines with appropriate emotion and style.
- Intelligent NPCs in Video Games: NPCs that offer more immersive and believable interactions through unique voices and consistent personalities.
- Educational Tools: Creating engaging characters for language learning or interactive simulations.
Key Experimental Findings:
- Language Understanding: OmniCharacter outperforms existing RPAs and LLMs (including larger models) on CharacterEval and shows strong performance on SocialBench, suggesting that integrating audio enhances language understanding for role-playing.
- Speech-Language Collaboration:
- Metrics-based: Achieved higher scores for content and style in speech-to-text and speech-to-speech instruction-following tasks on the OmniCharacter-10K test set compared to SpeechGPT and LLaMA-Omni.
- Human-based: Outperformed baselines in human evaluations across fluency, consistency, emotional expression, clarity, appropriateness, and immersion.
- Voice Synthesis Quality:
- Achieved higher cosine similarity between synthesized speech and reference character speech compared to VITS, MaskGCT, and CosyVoice, indicating better preservation of character voice traits.
- Demonstrated good discriminability between speech embeddings of different characters and users.
- Generalization: Showed comparable ASR (WER/CER) and TTS (WER) performance on general speech benchmarks like LibriSpeech, AISHELL-2, and LibriTTS.
- Impact of Audio: Ablation studies showed that including the audio modality significantly improves character consistency and conversational ability in RPAs.
Limitations Acknowledged:
- Performance depends on the quality of underlying pre-trained LLMs and TTS systems.
- Currently limited to two-character dialogues; multi-role interactions are a future step.
- The OmniCharacter-10K dataset, while diverse, might not cover all real-world conversational nuances.
- Latency, though low, could be further optimized for truly seamless real-time use.
In essence, OmniCharacter offers a practical framework for building RPAs that are not just textually coherent but also vocally expressive and character-consistent. By separating the tasks of language understanding/text generation from guided speech token prediction and specialized speech synthesis, it tackles the challenge of generating personalized speech for diverse characters in an interactive setting. The provision of the OmniCharacter-10K dataset is also a significant contribution to advancing research in this domain.