Marco-Voice: Expressive Neural TTS
- Marco-Voice is a multifunctional neural TTS system that synthesizes highly expressive, emotionally controllable speech while preserving speaker identity across diverse contexts.
- It employs a unified architecture with dedicated modules for disentangling speaker and emotion embeddings using techniques like cross-orthogonality and in-batch contrastive learning.
- By leveraging rotational emotional embeddings and adaptive cross-attention, Marco-Voice achieves smooth, natural speech synthesis supported by rigorous objective and subjective evaluations.
Marco-Voice is a multifunctional neural text-to-speech (TTS) system designed to synthesize highly expressive, emotionally controllable speech while preserving speaker identity across diverse linguistic and emotional contexts. Its technical innovations center on disentangling speaker and emotion representations, enabling independent control, and on rotational embedding strategies for smooth emotion modulation. Marco-Voice advances the synthesis of both cloned and emotionally expressive speech through unified modeling and specialized architectural components, supported by rigorous objective and subjective evaluation, as well as a purpose-built emotional speech dataset.
1. Unified Architecture for Voice Cloning and Emotion Control
Marco-Voice employs a unified neural TTS architecture integrating both voice cloning and emotion-controllable synthesis within the same framework. The system is composed of several modular components:
- Input Encoders: Separate encoders process input text tokens and reference speech (for speaker identity and emotion/style extraction).
- Speaker and Emotion Embedding Modules: From the reference speech, distinct speaker (timbre) and emotion (style) embeddings are obtained via dedicated encoders.
- Text-to-Token LLM: This module fuses linguistic content with the conditioning information from speaker and emotion embeddings.
- Flow Matching Module: Conditioned on all preceding representations, this module generates the acoustic parameters required for high-quality vocoder-based speech synthesis.
- Adaptive Cross-Attention: Emotion embeddings are used as queries in a cross-attention mechanism, dynamically modulating the linguistic representation during decoding to ensure continuous, contextually appropriate emotional expression.
The overall system supports high-fidelity generation for both speaker-driven voice cloning and smooth, independently adjustable emotional style control, without sacrificing clarity or expressiveness.
2. Speaker-Emotion Disentanglement and In-Batch Contrastive Learning
A key challenge in expressive speech synthesis is preventing entanglement between speaker identity (timbre) and emotional state. Marco-Voice addresses this using a dual strategy:
- Cross-Orthogonality Constraint: Separate encoders extract speaker and emotion embeddings from the same speech sample. Training imposes a cross-orthogonality loss, enforcing that the dot product between speaker and emotion embeddings is minimized (forcing these representations into orthogonal subspaces). This ensures that changes in emotion embeddings do not affect speaker identity, and vice versa.
- In-Batch Contrastive Learning: For each mini-batch during training, the emotion representation for each utterance is encouraged to be maximally distinct from others (especially those corresponding to different emotions). The contrastive loss, defined as
where is the projection of the -th sample’s features and is the -th emotion feature, further reinforces the separation of emotion style from speaker identity.
These complementary techniques enable reliable, independent manipulation of both speaker characteristics and emotional expression.
3. Rotational Emotional Embedding Integration
Marco-Voice introduces a rotational embedding technique for robust, smooth emotional conditioning:
- Rotational Emotional Direction: For each speaker, pairs of utterances—one emotional (), one neutral ()—are used. The normalized directional vector
represents the emotional offset in latent space.
- Aggregated Embedding: By averaging the normalized emotional directions across pairs,
a robust, smooth emotion embedding is obtained.
- Integration in Generation: These rotational embeddings are injected as conditioning signals at multiple stages in both the LLM and flow matching modules, providing fine-grained, smooth emotional control for the synthesized speech.
4. CSEMOTIONS High-Quality Emotional Speech Dataset
To support advanced training and evaluation for emotional TTS, the CSEMOTIONS dataset was constructed with the following characteristics:
- Content: 10 hours of Mandarin speech, recorded in a professional studio environment.
- Speakers: Six professional speakers (balanced by gender), each recording samples in seven distinct emotional categories.
- Design: Each utterance includes both emotional and neutral versions, enabling precise extraction of rotational emotion embeddings.
- Use Cases: The dataset was utilized for both model training and systematic benchmarking of emotional expressiveness in synthesized speech.
This resource facilitates rigorous, fine-grained evaluation and supports the development of new emotion modeling techniques.
5. Performance Metrics and Evaluation
Marco-Voice performance is assessed through a comprehensive suite of objective and subjective metrics:
Metric | Description |
---|---|
Word Error Rate (WER) | Recognition error, as evaluated by ASR models on synthesized speech |
Speaker Similarity | Similarity between synthesized and reference speech, using speaker verification |
DNS-MOS | Perceptual speech quality, measured via Deep Noise Suppression MOS |
Error Counts | Quantitative identification of insertions, deletions in transcript |
Subjective Ratings | Human evaluations of clarity, expressiveness, naturalness, satisfaction, identity |
Subjective human listening tests consistently place Marco-Voice at the upper end of naturalness, rhythmic quality, emotional expression, and speaker similarity, outperforming baseline systems such as CosyVoice1 and CosyVoice2 on all major axes.
6. Mathematical Formalisms
The work features several mathematically explicit methodologies used both for model design and evaluation:
- Rotational Emotion Embedding: As defined above,
- Cross-Orthogonality Loss: Computed as (using Frobenius norm or average of dot products) between mini-batch speaker and emotion embeddings.
- In-Batch Contrastive Loss: Expressed as
These losses are critical to the reliable separation and control of speaker and emotional attributes.
7. Comparative Analysis and Significance
Marco-Voice demonstrates clear empirical advantages:
- Faithful Speaker Cloning: Objective and subjective evaluations confirm improved preservation of speaker identity over previous systems.
- Enhanced Emotional Expressivity: Rotational embedding and adaptive attention mechanisms yield more natural prosody and nuanced emotional speech.
- Superior Overall Quality: Metrics such as lower WER, higher DNS-MOS, and top subjective ratings consolidate its position as an advancement in expressive neural TTS.
Collectively, the technical strategies and empirical results of Marco-Voice signify a substantial advance in controllable, expressive speech synthesis, with implications for a broad range of human–machine communication scenarios.