Marco-Voice: Expressive Neural TTS

Updated 8 August 2025

Marco-Voice is a multifunctional neural TTS system that synthesizes highly expressive, emotionally controllable speech while preserving speaker identity across diverse contexts.
It employs a unified architecture with dedicated modules for disentangling speaker and emotion embeddings using techniques like cross-orthogonality and in-batch contrastive learning.
By leveraging rotational emotional embeddings and adaptive cross-attention, Marco-Voice achieves smooth, natural speech synthesis supported by rigorous objective and subjective evaluations.

Marco-Voice is a multifunctional neural text-to-speech (TTS) system designed to synthesize highly expressive, emotionally controllable speech while preserving speaker identity across diverse linguistic and emotional contexts. Its technical innovations center on disentangling speaker and emotion representations, enabling independent control, and on rotational embedding strategies for smooth emotion modulation. Marco-Voice advances the synthesis of both cloned and emotionally expressive speech through unified modeling and specialized architectural components, supported by rigorous objective and subjective evaluation, as well as a purpose-built emotional speech dataset.

1. Unified Architecture for Voice Cloning and Emotion Control

Marco-Voice employs a unified neural TTS architecture integrating both voice cloning and emotion-controllable synthesis within the same framework. The system is composed of several modular components:

Input Encoders: Separate encoders process input text tokens and reference speech (for speaker identity and emotion/style extraction).
Speaker and Emotion Embedding Modules: From the reference speech, distinct speaker (timbre) and emotion (style) embeddings are obtained via dedicated encoders.
Text-to-Token LLM: This module fuses linguistic content with the conditioning information from speaker and emotion embeddings.
Flow Matching Module: Conditioned on all preceding representations, this module generates the acoustic parameters required for high-quality vocoder-based speech synthesis.
Adaptive Cross-Attention: Emotion embeddings are used as queries in a cross-attention mechanism, dynamically modulating the linguistic representation during decoding to ensure continuous, contextually appropriate emotional expression.

The overall system supports high-fidelity generation for both speaker-driven voice cloning and smooth, independently adjustable emotional style control, without sacrificing clarity or expressiveness.

2. Speaker-Emotion Disentanglement and In-Batch Contrastive Learning

A key challenge in expressive speech synthesis is preventing entanglement between speaker identity (timbre) and emotional state. Marco-Voice addresses this using a dual strategy:

Cross-Orthogonality Constraint: Separate encoders extract speaker and emotion embeddings from the same speech sample. Training imposes a cross-orthogonality loss, enforcing that the dot product between speaker and emotion embeddings is minimized (forcing these representations into orthogonal subspaces). This ensures that changes in emotion embeddings do not affect speaker identity, and vice versa.
In-Batch Contrastive Learning: For each mini-batch during training, the emotion representation for each utterance is encouraged to be maximally distinct from others (especially those corresponding to different emotions). The contrastive loss, defined as

$L_\text{contrast} = \frac{1}{N(N-1)/2} \sum_{i<j} |\langle h_i, e_j \rangle|$

where $h_i$ is the projection of the $i$ -th sample’s features and $e_j$ is the $j$ -th emotion feature, further reinforces the separation of emotion style from speaker identity.

These complementary techniques enable reliable, independent manipulation of both speaker characteristics and emotional expression.

3. Rotational Emotional Embedding Integration

Marco-Voice introduces a rotational embedding technique for robust, smooth emotional conditioning:

Rotational Emotional Direction: For each speaker, pairs of utterances—one emotional ( $u_i^e$ ), one neutral ( $u_i^n$ )—are used. The normalized directional vector

$v_i^e = \frac{u_i^e - u_i^n}{\|u_i^e - u_i^n\|}$

represents the emotional offset in latent space.

Aggregated Embedding: By averaging the normalized emotional directions across $N$ pairs,

$e = \frac{1}{N} \sum_{i=1}^N v_i^e,$

a robust, smooth emotion embedding is obtained.

Integration in Generation: These rotational embeddings are injected as conditioning signals at multiple stages in both the LLM and flow matching modules, providing fine-grained, smooth emotional control for the synthesized speech.

4. CSEMOTIONS High-Quality Emotional Speech Dataset

To support advanced training and evaluation for emotional TTS, the CSEMOTIONS dataset was constructed with the following characteristics:

Content: 10 hours of Mandarin speech, recorded in a professional studio environment.
Speakers: Six professional speakers (balanced by gender), each recording samples in seven distinct emotional categories.
Design: Each utterance includes both emotional and neutral versions, enabling precise extraction of rotational emotion embeddings.
Use Cases: The dataset was utilized for both model training and systematic benchmarking of emotional expressiveness in synthesized speech.

This resource facilitates rigorous, fine-grained evaluation and supports the development of new emotion modeling techniques.

5. Performance Metrics and Evaluation

Marco-Voice performance is assessed through a comprehensive suite of objective and subjective metrics:

Metric	Description
Word Error Rate (WER)	Recognition error, as evaluated by ASR models on synthesized speech
Speaker Similarity	Similarity between synthesized and reference speech, using speaker verification
DNS-MOS	Perceptual speech quality, measured via Deep Noise Suppression MOS
Error Counts	Quantitative identification of insertions, deletions in transcript
Subjective Ratings	Human evaluations of clarity, expressiveness, naturalness, satisfaction, identity

Subjective human listening tests consistently place Marco-Voice at the upper end of naturalness, rhythmic quality, emotional expression, and speaker similarity, outperforming baseline systems such as CosyVoice1 and CosyVoice2 on all major axes.

6. Mathematical Formalisms

The work features several mathematically explicit methodologies used both for model design and evaluation:

Rotational Emotion Embedding: As defined above,

$v_i^e = \frac{u_i^e - u_i^n}{\|u_i^e - u_i^n\|}, \quad e = \frac{1}{N}\sum_{i=1}^N v_i^e$

Cross-Orthogonality Loss: Computed as (using Frobenius norm or average of dot products) between mini-batch speaker and emotion embeddings.
In-Batch Contrastive Loss: Expressed as

$L_\text{contrast} = \frac{1}{N(N-1)/2} \sum_{i<j} |\langle h_i, e_j \rangle|$

These losses are critical to the reliable separation and control of speaker and emotional attributes.

7. Comparative Analysis and Significance

Marco-Voice demonstrates clear empirical advantages:

Faithful Speaker Cloning: Objective and subjective evaluations confirm improved preservation of speaker identity over previous systems.
Enhanced Emotional Expressivity: Rotational embedding and adaptive attention mechanisms yield more natural prosody and nuanced emotional speech.
Superior Overall Quality: Metrics such as lower WER, higher DNS-MOS, and top subjective ratings consolidate its position as an advancement in expressive neural TTS.

Collectively, the technical strategies and empirical results of Marco-Voice signify a substantial advance in controllable, expressive speech synthesis, with implications for a broad range of human–machine communication scenarios.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Marco-Voice.