Expressive Speech-to-Speech Translation

Updated 28 September 2025

Expressive S2ST is a translation approach that directly maps source speech to target speech, retaining speaker identity, emotion, and prosodic features.
It leverages end-to-end architectures, textless unit representations, and diffusion models to accurately align linguistic content with expressive cues.
Robust evaluation using metrics like ASR-BLEU and MOS demonstrates significant improvements in naturalness, expressivity, and real-world application performance.

Expressive Speech-to-Speech Translation (S2ST) refers to automated systems that map spoken utterances in a source language directly to spoken translations in a target language, with an explicit emphasis on preserving speaker identity, vocal style, emotion, and prosody in the synthesized output. While early S2ST focused mainly on word-level translation accuracy and intelligibility, recent advances have established expressive S2ST as a critical area for cross-lingual communication, aiming to retain nuances of speaker intent and affect alongside linguistic content (Cheng et al., 25 Sep 2025).

1. Defining Expressive S2ST: Objectives and Scope

Expressive S2ST extends conventional S2ST by targeting faithful transfer of paralinguistic and stylistic features. The key objectives include:

Accurate translation of linguistic content: The system must render the semantics of the source utterance correctly in the target language.
Preservation of speaker identity: The generated speech should remain identifiable as being ‘from’ the source speaker, even in another language.
Transfer of emotional style and prosodic features: Prosody (pitch, duration, intonation, emphasis), vocal timbre, and affective states must be mapped, preventing the "flattening" typical of text-only systems.
Temporal and durational consistency: Output utterance duration should remain consistent with the source, essential for applications such as dubbing and real-time interpretation.

While the ultimate goal is seamless cross-lingual and cross-cultural communication retaining all expressive aspects, research has identified data scarcity, efficient architecture design, and effective cross-modal alignment as persistent challenges (Cheng et al., 25 Sep 2025, Min et al., 1 Feb 2025).

2. Model Architectures and Technical Approaches

Direct End-to-End Architectures

Early work (Jia et al., 2019) demonstrated an attention-based sequence-to-sequence architecture that directly maps source speech spectrograms (e.g. 80-channel log-mel) to target spectrograms without explicit intermediate text. The encoder typically uses deep BLSTM stacks; decoders employ autoregressive LSTM or Transformer modules with multi-head attention. Auxiliary losses for predicting phoneme sequences at the decoder are essential for aligning semantic and acoustic content and for stable training.

Translatotron 2 (Jia et al., 2021) introduced a modular separation of speech encoding, linguistic decoding (phoneme-based), and autoregressive acoustic synthesis, all connected by a shared attention module. Duration-based acoustic modeling, implemented via L² losses on utterance length, and the use of Conformer encoders with data augmentation have become common, offering robustness and accurate alignment.

Unit-Based and Textless Approaches

Recent approaches leverage “textless” representations, converting speech to discrete units such as HuBERT or VQ-VAE tokens (Popuri et al., 2022, Dong et al., 2023, Min et al., 1 Feb 2025). In these systems:

Source speech is mapped to semantic units via self-supervised pre-trained encoders.
Translation is performed at the unit sequence level (S2UT, S2MU, S2MU), followed by unit-to-speech (U2S) synthesis using HiFi-GAN or other neural vocoders.
Expressivity is preserved and transferred via conditioning mechanisms (style embeddings, emotion tokens, local pitch contours).

PolyVoice (Dong et al., 2023) and chain-of-thought S2ST models (Gong et al., 30 May 2024, Cheng et al., 25 Sep 2025) integrate semantic and acoustic modeling via decoder-only Transformer blocks, leveraging LLM pretraining and interleaved text–speech training for improved transfer learning and expressive output.

Diffusion and Advanced Generative Models

Emerging models (Mishra et al., 4 May 2025) utilize diffusion-based generative processes for Mel spectrogram generation, with conditional modeling on phoneme sequences and speaker/accent embeddings. This framework supports simultaneous accent adaptation and translation, optimizing over joint phoneme and prosodic cues and producing high-quality, accent-corrected speech.

3. Data Resources and Expressive Benchmarking

Multilingual and Expressive Corpora

CVSS Corpus (Jia et al., 2022, Sarim et al., 3 Mar 2025): A large, 21-language S2ST corpus derived from Common Voice and CoVoST2, with two target speech variants: canonical (CVSS-C) and voice-transferred (CVSS-T). CVSS-T supports research on voice identity and prosody preservation.
Movie-Aligned and Dubbed Speech (Min et al., 1 Feb 2025): Datasets constructed from movie and TV show audio, rigorously aligned for paralinguistic features and duration, enable training and evaluation on segments with matched prosody and emotion.
UniST Dataset (Cheng et al., 25 Sep 2025): A 44.8k-hour corpus synthesized via a scalable, expressive TTS pipeline, maintaining high fidelity in speaker identity, emotion, and timing.

Evaluation Measures

ASR-BLEU: Measures translation quality of generated speech via ASR transcription and BLEU comparison.
MOS (Mean Opinion Score): Subjective perceptual naturalness and similarity assessment, including style and emotion retention.
VSim and Speaker Similarity: Cosine similarity between speaker style embeddings (e.g., WavLM, ECAPA-TDNN) to assess identity transfer.
Duration/Compliance (SLC): Metrics for timing alignment between source and output.
Acoustic Feature Distance: Use of eGeMAPS and other feature sets for quantitative prosody/emotion match (Duret et al., 2023).
Robustness under Noise: Objective (ASR-BLEU, AutoPCP, SNR) and subjective (speaker-MOS) assessments in noisy conditions (Hwang et al., 4 Jun 2024).

4. Expressive Modeling: Strategies and Innovations

Explicit Prosody and Style Conditioning

Global Style and Local Prosody: Encoders extract global (emotion/state) and local (pitch, emphasis) features. Models such as (Min et al., 1 Feb 2025) introduce loss terms to enforce style and pitch consistency.
Emotion Embedding: Multilingual emotion embeddings (size-96, from Wav2Vec2-XLSR or similar) are used as conditioning factors for both duration and pitch predictors during synthesis (Duret et al., 2023).
Chain-of-Thought Prompting: Cross-modal, stepwise reasoning is used to align audio semantics to text before decoding to the final speech, facilitating transfer of translation capacity from text LLMs and a more faithful transition of expressive content (Cheng et al., 25 Sep 2025, Gong et al., 30 May 2024).
Scheduled Interleaved Speech-Text Training: Progressive replacement of text tokens with speech units in training with a decreasing schedule, facilitating LLMs’ adaptation from text to speech representation modalities (Futami et al., 12 Jun 2025).

Privacy and Regulation

Preset-Voice Matching (PVM) (Platnick et al., 18 Jul 2024) proposes avoiding direct voice cloning by matching input voice to the closest consenting preset-voice in target language, based on feature codes such as gender and emotion, thereby meeting privacy regulations and reducing misuse risk.

Robustness to Real-World Conditions

Noise-robust unit-to-speech architectures (Hwang et al., 4 Jun 2024) integrate DINO-based self-distillation strategies at pretraining, ensuring paralinguistic/expressive embeddings remain invariant to channel noise, supporting field deployment under adverse conditions.

5. Performance, Application, and Impact

Recent advances demonstrate that direct expressive S2ST models can closely approach, and occasionally match, strong cascaded baselines in BLEU, MOS, and speaker similarity, particularly when leveraging pre-training, chain-of-thought prompting, and expressive unit-level modeling (Cheng et al., 25 Sep 2025, Min et al., 1 Feb 2025, Jia et al., 2022).

Key findings include:

BLEU differences between state-of-the-art direct and cascaded systems can be as low as 0.1 to 0.7 points with proper pre-training and initialization (Jia et al., 2022, Sarim et al., 3 Mar 2025).
MOS scores for naturalness and emotional similarity approach those of high-quality TTS and enterprise systems with chain-of-thought and CoT-style prompting (Cheng et al., 25 Sep 2025).
Empirical improvements in expressivity and rhythm (up to 55% over vanilla baselines) have been observed via explicit modeling of style and prosody (Min et al., 1 Feb 2025).
Parameter-efficient architectures exist (e.g., single AR Transformer with multi-stream acoustic modeling (Gong et al., 30 May 2024)), reducing model complexity without degrading quality.
Scheduled interleaved training with LLMs offers significant performance boosts in low-resource languages (Futami et al., 12 Jun 2025).

Applications range from real-time international communication (conferences, live dubbing), accessible education, and culturally faithful media localization, to privacy-sensitive domains where regulated voice transfer is essential (Platnick et al., 18 Jul 2024). Expressive S2ST is also central for cross-cultural accessibility and minimizing loss of social-emotional cues in AI-mediated conversation.

6. Open Challenges and Future Directions

Key challenges as identified in recent literature (Gupta et al., 13 Nov 2024, Sarim et al., 3 Mar 2025, Min et al., 1 Feb 2025, Cheng et al., 25 Sep 2025):

Parallel Expressive Data Scarcity: Large-scale, semantically and expressively aligned corpora remain rare; ongoing efforts target scalable synthesis and weakly supervised mining.
Architectural Innovation: Improved modeling of prosody, emotion, and voice timbre; development of better prompt strategies and non-autoregressive decoding for latency-critical scenarios.
Generalization to Low-Resource and Unwritten Languages: Further leveraging self-supervised, “textless” modeling and zero-shot learning via data augmentation and cross-modal pretraining (Chen et al., 2022, Dong et al., 2023).
Evaluating Expressiveness: Need for better automatic metrics that go beyond BLEU—such as BLASER for text-free evaluation or direct comparison of eGeMAPS-based acoustic features.
Privacy, Security, and Regulation: Mitigating the risk of voice cloning misuse through frameworks like PVM, and evolving protocols as regulatory environments mature (Platnick et al., 18 Jul 2024).
Scalability and Efficiency: Expanding to broader language sets, multilingual directions, and further reducing resource and compute requirements, possibly integrating LLM-based chain-of-thought frameworks.

Table: Expressive S2ST System Types and Key Features

System Type	Expressivity Mechanism	Representative Papers
Seq2Seq Spectrogram-based (direct)	Speaker encoder, auxiliary phonemes	(Jia et al., 2019, Jia et al., 2021)
Discrete Unit/Unit-based (textless)	Prosody/pitch/style token conditioning	(Popuri et al., 2022, Min et al., 1 Feb 2025)
Chain-of-Thought Prompted LLM	Stagewise semantic+style reasoning	(Gong et al., 30 May 2024, Cheng et al., 25 Sep 2025)
Diffusion-based Joint S2ST	Conditional diffusion, accent adaptation	(Mishra et al., 4 May 2025)
Cascaded Regulated (PVM)	Feature-matched preset-voices	(Platnick et al., 18 Jul 2024)

7. Conclusion

Expressive speech-to-speech translation stands at the intersection of advances in end-to-end neural modeling, self-supervised representation learning, and cross-modal prompt engineering. With increasingly unified architectures, scalable expressive datasets, and innovative training regimens (e.g., chain-of-thought and interleaved text–speech adaptation), modern systems are closing the quality and fidelity gap with cascade pipelines while preserving paralinguistic content. Challenges in data resource availability, robust evaluation, and privacy regulation remain areas of active research, but the field has begun to deliver solutions poised for adoption in high-impact, real-world applications.