Expressive Voice Conversion

Updated 30 March 2026

Expressive voice conversion is a technique that transforms a source's speech to adopt a target speaker's emotional and prosodic style while preserving linguistic content.
It utilizes multi-path encoder-decoder architectures with dedicated encoders for content, speaker identity, and style, employing adversarial training and mutual information minimization.
System performance is evaluated using metrics like MCD, F0 correlation, and MOS, though challenges remain in accurately capturing extreme expressivity and time-varying nuances.

Expressive voice conversion (EVC) addresses the problem of transforming a speech utterance such that both the speaker identity and the expressive attributes of a target speaker are transferred to the source speaker’s content, with particular emphasis on emotional style, prosody, and paralinguistic characteristics. Unlike standard voice conversion—which typically focuses on speaker timbre or identity—expressive voice conversion targets the generation of speech samples that accurately capture both the rich emotional nuances and the personal realization of expressive cues characteristic of a particular speaker.

1. Problem Scope and Formalization

Expressive voice conversion can be mathematically described as the mapping

$y_{\text{out}} = \mathcal{F}(x_{\text{src}}, x_{\text{target-style}})$

where $x_{\text{src}}$ is an utterance spoken with the linguistic content to be preserved, and $x_{\text{target-style}}$ is a reference from which speaker identity and expressive style (e.g., emotion, prosody) must be extracted. The goal is for $y_{\text{out}}$ to preserve the linguistic content of $x_{\text{src}}$ while inheriting the target’s speaker-dependent expressive characteristics, including F0, energy, rhythm, and emotion-induced spectrotemporal modulations (Schnell et al., 2021, Du et al., 2021, Du et al., 2021, Akti et al., 4 Jun 2025).

EVC is distinguished from (a) speaker identity conversion, which does not explicitly consider expressive or emotion factors, and (b) emotional voice conversion (EVC in a narrow sense), which typically operates within a single speaker and only manipulates emotional class or style, not the individual realization.

2. Core Model Architectures and Disentanglement Strategies

EVC frameworks universally aim to disentangle content, speaker identity, and expressive style. Canonical approaches instantiate this as a multi-path encoding/decoding architecture:

Content Encoder: Extracts linguistic information, typically from PPGs, ASR bottlenecks, HuBERT/mHuBERT units, or self-supervised latent representations. Discrete content units or vectors obtained via quantization or fixed averaging promote information bottlenecks and strip non-linguistic information (Akti et al., 4 Jun 2025, Martín-Cortinas et al., 13 May 2025, Schnell et al., 2021, Gan et al., 2022).
Speaker Encoder: Often realized with a ResNet, ECAPA-TDNN, or LSTM-based model trained under speaker verification losses (e.g., GE2E), producing a global (utterance-level) embedding (Neekhara et al., 2021, Martín-Cortinas et al., 13 May 2025, Du et al., 2021).
Style (Emotion/Prosody) Encoder: Style is modeled by learned embeddings derived from SER networks (Du et al., 2021), ECAPA-TDNN (Akti et al., 4 Jun 2025), or explicit pitch and energy contours (Du et al., 2021). Fusion of content and style is often handled via feature-wise modulation (conditional LayerNorm, AdaIN), cross-attention, or concatenation (Ning et al., 2022, Du et al., 2021, Akti et al., 4 Jun 2025).
Decoder: Typically an autoregressive or parallel mel-spectrogram generator (HiFi-GAN, WaveGlow, Parallel WaveGAN, diffusion models), conditioned on all or a subset of the above representations.
Explicit Disentanglement: Mutual-information minimization is frequently applied to enforce orthogonality between speaker, style, and content codes (Du et al., 2021, Akti et al., 4 Jun 2025). Prosody filtering via phone-aligned or random downsampling removes residual speaker and content from prosody codes (Gan et al., 2022). Adversarial objectives or gradient reversal/inversion blocks are used to limit emotion leakage (Schnell et al., 2021).

3. Prosody and Style Representation

Expressive style is captured through several non-exclusive mechanisms:

Disentangled Emotional Embeddings: Continuous or categorical embeddings derived from pre-trained SER networks encode global or segmental emotion style (Du et al., 2021, Du et al., 2021, Du et al., 2024).
Quantized Prosody Tokens: VQ-VAE or codebook-based tokenization of F0, energy, and sometimes rhythm enables interpretable, speaker-normalized style control and transfer (Zuo et al., 8 Feb 2025, Gan et al., 2022).
Local-Global Prosody Fusion: Local F0 or prosodic features are injected via cross-attention (Akti et al., 4 Jun 2025), multi-head self-attention (Ning et al., 2022), or gating/fusion layers (Wang et al., 26 Jan 2026, Ning et al., 2022, Du et al., 2024).
Expressive Guidance and Control: In diffusion-based models, specific guidance mechanisms (e.g., “expressive guidance” combining positive and negative score evaluations) are introduced to resolve ambiguities during denoising (Chou et al., 2024).
Ablation evidence: Removing local F0 from the fusion process or weakly enforcing style-content independence degrades emotion classification accuracy and speaker similarity (Akti et al., 4 Jun 2025, Ning et al., 2022).

4. Model Classes and Learning Paradigms

EVC research demonstrates a diversity of architectural choices, unified by their core objective of disentangled, controllable conversion:

VAE/Flow/Autoregressive Models: StyleVC (Du et al., 2021), ConsistencyVC (Guo et al., 2023), and similar frameworks deploy VAE or normalizing-flow backbones with explicit style-content-speech factorization and MI constraints.
Adversarial GAN Frameworks: JES-StarGAN extends StarGAN-VC to emotional style by introducing continuous style embeddings, adversarial and cycle-consistency losses (Du et al., 2021).
Non-parallel and Zero-shot Methods: ZSDEVC (Chou et al., 2024), DEVC (Du et al., 2024), and related conditional diffusion or flow-matching models support any-to-any conversion, including unseen speakers and/or unseen expressive styles, leveraging large self-supervised speech models, information bottlenecking, MI minimization, and explicit fusion of local/global style cues.
Pitch/Prosody-token-based Zero-shot VC: PFlow-VC (Zuo et al., 8 Feb 2025) and related models inject discrete pitch tokens with masked prompting to enable in-context expressive style transfer, and demonstrate that explicit control of prosodic tokens improves both emotion fidelity and voice quality.

5. Evaluation Protocols and Metrics

Evaluation of EVC models incorporates both objective and subjective criteria:

Objective metrics: Mel-cepstral distortion (MCD), character/word error rate (CER/WER), speaker embedding cosine similarity (SECS), emotion embedding cosine similarity (EECS), F0 frame error (FFE), and emotion classification accuracy (ECA) (Schnell et al., 2021, Du et al., 2024, Akti et al., 4 Jun 2025, Du et al., 31 Oct 2025).
Subjective metrics: Mean Opinion Score (MOS) for speech naturalness, speaker similarity, and emotion or style similarity; ABX preference or forced-choice tests (Schnell et al., 2021, Ning et al., 2022, Du et al., 2021, Akti et al., 4 Jun 2025).
Prosody correlation: F0-correlation (Pearson’s $r$ , $\rho_{F0}$ ) between source and converted utterances as a proxy for expressiveness (Martín-Cortinas et al., 13 May 2025, Ning et al., 2022).

Experimental results across numerous systems confirm that contemporary EVC models can achieve audio quality on par with real recordings for most conditions, with trade-offs manifesting mainly in extremely high-intensity or underrepresented emotions (Schnell et al., 2021, Du et al., 2024). Zero-shot and multilingual models relying on self-supervised content units and robust bottlenecking generalize well across speakers and accents while retaining high prosody fidelity (Martín-Cortinas et al., 13 May 2025, Wang et al., 26 Jan 2026, Akti et al., 4 Jun 2025).

6. Data Regimes and Practical Considerations

Data Requirements: Combination of large-scale neutral/expressive corpora and limited labeled emotional data; e.g., EmoCat achieves high-quality conversion in German using only 45 min of emotional recordings but leverages 10 hours of supporting English expressive data (Schnell et al., 2021).
Annotation Strategies: Massive semi-automated resources such as the NaturalVoices dataset provide emotion, speaker, and prosody labels at scale for robust benchmarking (Du et al., 31 Oct 2025).
Training Protocols: Progressive training, domain adaptation (e.g. LoRA-based experts in OneVoice (Wang et al., 26 Jan 2026)), curriculum balancing, and class-frequency-aware loss weighting address class imbalance and scarce expressivity data.
Speaker and Prosody Disentanglement: Adversarial objectives, random erasing, and explicit mutual information losses are crucial for preventing information leakage and maintaining conversion fidelity (Schnell et al., 2021, Jiang et al., 7 Aug 2025, Ning et al., 2022).

7. Limitations, Open Challenges, and Future Directions

Current limitations include:

Expressive intensity is not matched at the most extreme levels (e.g., highly expressive utterances, rare emotions) (Schnell et al., 2021).
Explicit fine-grained control over time-varying style and emotion remains underexplored (Du et al., 2024).
Many models assume frame-level or utterance-level style uniformity, lacking segmental or time-varying style granularity (Du et al., 31 Oct 2025, Du et al., 2024).
Robustness to general acoustic conditions, including noise and music artifacts, though addressed by simulation-based training in singing VC, is still open for conversational settings (Zheng et al., 23 Oct 2025).
Universal, truly language-agnostic EVC with minimal emotional data for new expressivities remains an open problem.

Future research is expected to advance by integrating hierarchical style representations, richer self-supervised bottlenecks, and enhanced training protocols (e.g. segmental or context-conditioned prosody encoders), expanding expressivity control and improving cross-lingual, cross-domain generalization (Wang et al., 26 Jan 2026, Martín-Cortinas et al., 13 May 2025, Akti et al., 4 Jun 2025, Zuo et al., 8 Feb 2025, Du et al., 31 Oct 2025).