PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Published 10 Jan 2025 in cs.SD, cs.CL, and eess.AS | (2501.06276v1)

Abstract: Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage LLMs to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel prompt-driven TTS framework that integrates emotion and intensity encoders into the FastSpeech 2 architecture.
It employs LLM-based prompt control with HuBERT feature extraction to adjust prosody at both global and local levels, enhancing speech expressiveness.
The framework outperforms competing models with improved emotion classification accuracy and minimal distortion, as confirmed by both objective metrics and human evaluations.

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Introduction

The advancement in speech synthesis has transitioned from statistical techniques to deep learning methodologies, enabling the generation of text-to-speech systems that are increasingly realistic and humanlike. While these systems have achieved a high degree of naturalness, the replication of expressive qualities such as emotion, intonation, and speaking style remains a significant challenge. The paper "ProEmo: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control" (2501.06276) addresses this challenge by proposing a framework for expressive speech synthesis using prompt-based methods to control emotions and intensity across multiple speakers. This synthesis approach leverages LLMs to manipulate speech prosody, integrating emotional and intensity cues via embeddings, resulting in enhanced expressiveness and variability in speech.

Methodology

The core of the proposed framework involves augmenting the FastSpeech 2 (FS2) architecture by incorporating emotion and intensity encoders. This enhances the naturalness and expressiveness of synthesized speech. The pipeline consists of several stages, including multispeaker TTS backbone pre-training, followed by the integration of emotion and intensity encoders during fine-tuning. The framework utilizes HuBERT for feature extraction, with distinct heads dedicated to emotion classification and intensity regression.

Figure 1: Overview of proposed expressive speech generation framework composed of 4 modules: TTS backbone based on FS2, HuBERT for Emotion Encoder, HuBERT for Intensity Encoder, GPT-4 Prompting for prosody control.

The introduction of prompt control, using LLM-based global and local modifications, allows fine-tuned control over prosody at both the utterance and word levels. These modifications ensure that the synthesized speech maintains a desired emotional tone while adjusting subtleties in pitch, duration, and energy for expressive effects.

Figure 2: Introduction to Prompt Control: The scaling factors suggested by the LLM directly affect the Variance Adaptor.

Experimental Setup

The experiments leverage datasets such as LibriTTS and the Emotional Speech Database (ESD) to pre-train and fine-tune the FS2 model. The evaluation spans different configurations, testing the incorporation of emotion and intensity encoders as well as the implementation of prompt controls. Metrics such as Emotion Classification Accuracy (ECA), Mel Cepstral Distortion (MCD), Word Error Rate (WER), and Character Error Rate (CER) are used to objectively assess speech synthesis quality and expressiveness. Human subjective tests, including Mean Opinion Score (MOS) and Perceptual Intensity Ranking (PIR), corroborate the objective findings by assessing perceived expressiveness and intensity levels.

Figure 3: Perceptual Intensity Ranking. H,M,L: High, Medium and Low intensities.

Results

The incorporation of emotion and intensity embeddings notably enhances emotion classification accuracy in generated speech. FS2 models with added encoders consistently outperform simpler configurations and competing architectures like Daft-Exprt, showcasing superior expressiveness, particularly when employing local-level prompt control. Objective metrics reveal minimal impact on speech distortion, indicating efficient integration of additional encoders without degradation. Human evaluations support these findings, highlighting improvements in perceived expressiveness and coherence.

Figure 4: t-SNE of embeddings for the ESD validation set. (a) Intensity embeddings computed using the relative ranking function, r(.). (b) Joint emotion and intensity embeddings.

Conclusion

The presented architecture and methodology provide a substantive advancement in the field of expressive speech synthesis, demonstrating the potential for LLM-driven, prompt-based control to elevate the naturalness and variability of artificial speech. Through collaborative use of emotion and intensity encoders, along with refined prompt control mechanisms, the framework successfully models nuanced emotional expression in synthesized speech. Future research can extend these exploration pathways into multilingual contexts, real-time applications, and robustness across diverse emotional cue settings.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Summary

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Introduction

Methodology

Experimental Setup

Results

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Summary

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Introduction

Methodology

Experimental Setup

Results

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets