OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction (2505.20277v2)

Published 26 May 2025 in cs.CL and cs.CV

Abstract: Role-Playing Agents (RPAs), benefiting from LLMs, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-LLMs, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.

Summary

The paper introduces a novel framework that integrates text and speech to create role-playing agents with consistent, character-specific vocal traits.
It employs a two-stage training strategy with a Speech-Language Collaborative Model and a dedicated Role Speech Decoder, achieving a competitive 289ms response latency.
Experimental results demonstrate superior language understanding, enhanced voice synthesis quality, and improved human-evaluated immersion compared to existing models.

The paper "OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction" (2505.20277) introduces a novel framework for creating Role-Playing Agents (RPAs) that go beyond text-based interactions by incorporating distinct vocal traits for each character. The core idea is to make RPAs more immersive by enabling them to exhibit consistent personalities through both their language and their speech, including voice style and emotions.

The proposed system, OmniCharacter, aims to achieve this with low latency. It consists of two main components:

Speech-Language Collaborative Model: This serves as the base, processing both language inputs (like role profiles, dialogue history, and user text) and speech inputs (user's spoken query).
- Speech Encoder: Converts raw audio input $X_n^S$ into a sequence of speech frames $M_n^S$ . The paper uses Whisper-large-v3 for this.
- Speech Adaptor: Reduces temporal redundancy in the encoded speech sequence by grouping $k$ consecutive frames and then transforms this compact sequence $Z_n^S$ into an embedding $E_n^S$ compatible with the LLM.
- LLM: The paper utilizes Qwen2.5-7B-Instruct. It takes concatenated text embeddings $E_n^T$ (from profile $P$ , context $U_n$ , and text query $X_n^T$ ) and speech embeddings $E_n^S$ . The LLM is trained auto-regressively to predict textual responses $O^T$ using the loss function:
  
  $\mathcal{L}_{language} = -\sum_{i=1}^{N_t} \log P(O^T_i \mid O^T_{<i})$
Role Speech Decoder: This component generates role-specific speech responses with unique vocal traits.
- Role-context Guided Speech Token Prediction: Instead of having the main LLM predict both text and speech tokens directly (which can lead to instability and hallucinations), a lightweight LLM (SpeechLLM, specifically Qwen2.5-0.5B-Instruct) is used. This SpeechLLM takes the hidden state representations $H$ from the main LLM as context (passed through a linear projection $\phi$ ) to predict speech tokens $O^S$ :
  
  $O^S = \text{SpeechLLM}(O_m^S | O^S_{1:m-1}, \phi(H))$
  
  It is trained with an auto-regressive objective:
  
  $\mathcal{L}_{speech} = -\sum_{i=1}^{N_s} \log P(O^S_i \mid O^S_{<i})$
  
  A speech tokenizer with a 16K vocabulary size (from GLM-4-Voice) is used.
- Role-aware Speech Synthesis: This module converts the predicted speech tokens $O^S$ into an audio waveform $Y_n^S$ . It first decodes speech tokens into a Mel spectrogram using a conditional flow matching (CFM) model, conditioned on LLM context $H$ , speaker embedding $v$ (extracted using CAM++ from character voice), and speech tokens $O^S$ :
  
  $Y_n^S = \text{OT-CFM}(p_0 | H, v, O^S)$
  
  A HiFi-GAN vocoder then synthesizes the final waveform from this Mel spectrogram.

The overall architecture of OmniCharacter is illustrated below:

graph TD
    subgraph User Input
        U_Speech[User Speech Input X_i^S]
        U_Text[User Text Input X_i^T]
    end

    subgraph System Input
        RoleProfile[Role Profile P]
        DialogContext[Dialogue Context C_n]
    end

    subgraph Speech-Language Collaborative Model
        SpeechEncoder[Speech Encoder (Whisper-large-v3)]
        SpeechAdaptor[Speech Adaptor τ]
        TextEncoder[Text Encoder]
        LLM[LLM (Qwen2.5-7B-Instruct)]

        U_Speech --> SpeechEncoder
        SpeechEncoder --> SpeechAdaptor
        SpeechAdaptor -- E_n^S --> LLM
        U_Text -- E_n^T --> LLM
        RoleProfile -- E_n^T --> LLM
        DialogContext -- E_n^T --> LLM
        LLM -- LLM Hidden States H --> RoleContextGuidedSTP
        LLM -- Text Tokens O^T --> AgentTextResponse[Agent Text Response Y_n^T]
    end

    subgraph Role Speech Decoder
        RoleContextGuidedSTP[Role-context Guided Speech Token Prediction (SpeechLLM: Qwen2.5-0.5B-Instruct)]
        RoleAwareSpeechSynth[Role-aware Speech Synthesis (OT-CFM + HiFi-GAN Vocoder)]
        SpeakerEmb[Speaker Embedding v]

        RoleContextGuidedSTP -- Speech Tokens O^S --> RoleAwareSpeechSynth
        SpeakerEmb --> RoleAwareSpeechSynth
        LLM -- LLM Hidden States H --> RoleAwareSpeechSynth
        RoleAwareSpeechSynth -- Waveform Y_n^S --> AgentSpeechResponse[Agent Speech Response Y_n^S]
    end

    AgentTextResponse --> Output
    AgentSpeechResponse --> Output

Training Strategy:

A two-stage approach is used:

Stage 1: Train the speech adaptor and LLM for text response generation based on text and speech inputs ( $\mathcal{L}_{language}$ ).
Stage 2: Freeze Stage 1 components and train only the linear layer and SpeechLLM for speech token prediction ( $\mathcal{L}_{speech}$ ). The role-aware speech synthesis module uses pre-trained weights from GLM-4-Voice and is fine-tuned on high-quality role speech data.

OmniCharacter-10K Dataset:

To train and evaluate OmniCharacter, the authors created a new dataset, OmniCharacter-10K.

Construction Pipeline:

1. Character Profile Creation: 20 characters (10 Chinese, 10 English) from Genshin Impact were selected. Profiles detailing personality, voice style, relationships, and experiences were generated using an LLM and human-verified. 2. Dialogue Generation: Two LLMs interacted in a simulated conversation, one as the target character and the other as a user/different character, guided by profiles. 3. Speech Synthesis: Character speech was synthesized using a VITS model trained on 40K high-quality audio samples extracted from game assets. User speech was synthesized using CosyVoice, with a 50/50 male/female voice distribution. 4. Quality Verification: Text data was filtered for ABAB patterns and dialogues longer than three turns. Speech data was filtered based on Word Error Rate (WER < 10 using Whisper-large-v3) and speaker similarity (> 0.8 using WavLLM).

Properties:
- Large Vocabulary: 20 characters, 10,072 multi-turn dialogues, 135K audio responses.
- Rich Annotations: Detailed character profiles and corresponding speech for dialogues.
- Dynamic Curation: Dialogues generated via chatbot interactions.
- The training set has 9,672 samples (360.3 speech hours), and the test set has 400 samples (14.84 speech hours). Over 80% of dialogues are longer than 10 turns.

Implementation and Deployment Considerations:

Computational Requirements: Training was done on 8xA100 GPUs for 3 epochs.
Latency: OmniCharacter achieves a speech response latency of 289ms, which is competitive (e.g., GPT-4o at 320ms, LLaMA-Omni at 226ms). The slight increase over some models is attributed to the additional modules for character-specific voice traits.
Modularity: The two-stage training and distinct components (speech-LLM, role speech decoder) allow for potentially independent upgrades or fine-tuning of parts of the system.
Scalability: While the dataset uses 20 characters, the framework is designed to potentially scale to more characters, provided sufficient character-specific voice data and profiles are available. The quality of speaker embeddings is crucial for distinguishing characters.

Applications:

This research enables more realistic and engaging RPAs for:

Virtual Assistants: Assistants with distinct, persistent personalities and voices.
AI-driven Storytelling: Characters in interactive narratives that can speak their lines with appropriate emotion and style.
Intelligent NPCs in Video Games: NPCs that offer more immersive and believable interactions through unique voices and consistent personalities.
Educational Tools: Creating engaging characters for language learning or interactive simulations.

Key Experimental Findings:

Language Understanding: OmniCharacter outperforms existing RPAs and LLMs (including larger models) on CharacterEval and shows strong performance on SocialBench, suggesting that integrating audio enhances language understanding for role-playing.
Speech-Language Collaboration:
- Metrics-based: Achieved higher scores for content and style in speech-to-text and speech-to-speech instruction-following tasks on the OmniCharacter-10K test set compared to SpeechGPT and LLaMA-Omni.
- Human-based: Outperformed baselines in human evaluations across fluency, consistency, emotional expression, clarity, appropriateness, and immersion.
Voice Synthesis Quality:
- Achieved higher cosine similarity between synthesized speech and reference character speech compared to VITS, MaskGCT, and CosyVoice, indicating better preservation of character voice traits.
- Demonstrated good discriminability between speech embeddings of different characters and users.
Generalization: Showed comparable ASR (WER/CER) and TTS (WER) performance on general speech benchmarks like LibriSpeech, AISHELL-2, and LibriTTS.
Impact of Audio: Ablation studies showed that including the audio modality significantly improves character consistency and conversational ability in RPAs.

Limitations Acknowledged:

Performance depends on the quality of underlying pre-trained LLMs and TTS systems.
Currently limited to two-character dialogues; multi-role interactions are a future step.
The OmniCharacter-10K dataset, while diverse, might not cover all real-world conversational nuances.
Latency, though low, could be further optimized for truly seamless real-time use.

In essence, OmniCharacter offers a practical framework for building RPAs that are not just textually coherent but also vocally expressive and character-consistent. By separating the tasks of language understanding/text generation from guided speech token prediction and specialized speech synthesis, it tackles the challenge of generating personalized speech for diverse characters in an interactive setting. The provision of the OmniCharacter-10K dataset is also a significant contribution to advancing research in this domain.

PDF Markdown

OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction (2505.20277v2)

Summary

Related Papers

GitHub