FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models (2305.13612v1)

Published 23 May 2023 in cs.SD and eess.AS

Abstract: Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To tackle the challenges in stutter removal, we propose FluentSpeech, a stutter-oriented automatic speech editing model. Specifically, 1) we propose a context-aware diffusion model that iteratively refines the modified mel-spectrogram with the guidance of context features; 2) we introduce a stutter predictor module to inject the stutter information into the hidden sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE) dataset that contains spontaneous speech recordings with time-aligned stutter labels to train the automatic stutter localization model. Experimental results on VCTK and LibriTTS datasets demonstrate that our model achieves state-of-the-art performance on speech editing. Further experiments on our SASE dataset show that FluentSpeech can effectively improve the fluency of stuttering speech in terms of objective and subjective metrics. Code and audio samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.

Authors (7)

Ziyue Jiang (38 papers)
Qian Yang (146 papers)
Jialong Zuo (23 papers)
Zhenhui Ye (25 papers)
Rongjie Huang (62 papers)
Yi Ren (215 papers)
Zhou Zhao (219 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces FluentSpeech, which employs a context-aware diffusion model and a dedicated stutter predictor to automate stutter removal.
It effectively mitigates over-smoothing, enhancing speech naturalness and fluency as demonstrated on VCTK, LibriTTS, and the SASE dataset.
The model achieves superior performance with improved MCD, STOI, and PESQ metrics while significantly reducing manual intervention in speech editing.

Overview of FluentSpeech: Stutter-Oriented Automatic Speech Editing

The paper "FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models" addresses critical challenges in the domain of speech editing, focusing on the automatic removal of stutters. The researchers identify three primary limitations of existing speech editing methods when applied to stuttered speech: over-smoothing in the edited speech, lack of robustness due to noise from stutters, and the manual effort required for identifying stutter regions. To address these, they present FluentSpeech, a generative model that automates the detection and removal of stutters while generating natural and fluent speech.

The paper introduces several innovative components. Firstly, the authors employ a context-aware diffusion model for refining mel-spectrogram modifications, avoiding the over-smoothing issue prevalent in non-probabilistic models. Secondly, they incorporate a stutter predictor module designed to localize stutter regions and inject stutter-related information into hidden sequences, enhancing the robustness of the system against the discrepancies between textual transcripts and actual speech content. Finally, the creation of the stutter-oriented automatic speech editing (SASE) dataset is pivotal, offering a new benchmark for training and evaluating models on this task.

Key Contributions and Results

The paper makes several notable contributions to the field:

Context-Aware Diffusion Model: By utilizing contextual features in the modification of mel-spectrograms, FluentSpeech mitigates the over-smoothing problem and achieves a high level of expressiveness and sound quality in the edited speech.
Stutter Predictor Module: This module's design enables the automatic identification of stutter regions, allowing FluentSpeech to selectively apply corrections, significantly reducing the labor required in traditional methods.
SASE Dataset: The introduction of a dataset containing 40 hours of annotated spontaneous speech with time-aligned stutter labels provides a crucial resource for future research and validation in automatic stutter removal.

Experiments conducted on the VCTK and LibriTTS datasets reveal that FluentSpeech surpasses existing state-of-the-art models in terms of objective metrics like MCD, STOI, and PESQ for speech quality and intelligibility. The testing also demonstrates the model’s efficiency, achieving superior results with fewer parameters compared to peer models, highlighting its computational effectiveness.

On the SASE dataset, FluentSpeech showcases its robustness and fluency improvements, marking a significant step forward in stutter removal tasks. This is further verified through strong numerical results and positive subjective assessments in both naturalness and fluency from human evaluators.

Implications and Future Directions

The research presented in this paper offers several implications for both theoretical exploration and practical applications. The application of diffusion models conditioned on complex contextual information could inspire further innovations in various speech and audio-related synthesis tasks. Additionally, the context-aware methodology presented herein could be extended beyond stutter removal to more general speech refinement tasks, aiding in the development of automated systems capable of handling diverse speech imperfections.

Looking forward, future research could focus on enhancing the stutter predictor's architecture to further improve accuracy and efficiency. Moreover, extending FluentSpeech's capabilities to process multilingual stuttering speech presents an intriguing direction for research, given the global diversity of stuttering patterns and their linguistic contexts.

The work on FluentSpeech is fundamental but not exhaustive. Consequently, it lays a solid groundwork for subsequent studies and applications in speech editing, specifically those aimed at enhancing communication accessibility and media production efficiency. The potential to adapt components of this model to real-time processing contexts could also be explored, offering greater immediacy in a range of practical applications such as live broadcasting and telecommunication.

PDF Markdown

Related Papers

GitHub

GitHub - Zain-Jiang/Speech-Editing-Toolkit: It's a repository for implementations of neural speech editing algorithms. (192 stars)