Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 167 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 42 tok/s Pro

GPT-4o 97 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 442 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

ECTSpeech: One-Step Diffusion TTS

Updated 9 October 2025

ECTSpeech is a one-step speech synthesis framework that fine-tunes a pretrained diffusion model via Easy Consistency Tuning to achieve efficient and high-fidelity audio generation.
It incorporates a Multi-Scale Gate Module (MSGate) to adaptively fuse features at various scales, enhancing the reconstruction of spectral details.
Experimental results on LJSpeech demonstrate that ECTSpeech closely matches state-of-the-art audio quality while significantly reducing training complexity and inference time.

ECTSpeech is a one-step speech synthesis framework that applies the Easy Consistency Tuning (ECT) strategy to a pretrained diffusion model for text-to-speech (TTS) generation. This approach is designed to address the high computational cost and slow inference of conventional diffusion-based TTS systems, allowing high-quality audio to be generated efficiently in a single sampling step. ECTSpeech further incorporates a novel Multi-Scale Gate Module (MSGate) in the denoising network, enhancing multi-scale feature fusion for improved speech fidelity. The framework demonstrates competitive performance on LJSpeech, closely matching state-of-the-art multi-step and consistency-distilled methods, while substantially reducing training complexity and inference latency (Zhu et al., 7 Oct 2025).

1. Motivation and Key Concepts

Diffusion models have established state-of-the-art performance in audio and speech synthesis by producing high-fidelity samples through iterative denoising, where a learnable neural network sequentially removes noise from an input. However, standard approaches require many denoising steps, incurring significant inference delays. Previous attempts to reduce inference steps rely on consistency models distilled from teacher diffusion models (e.g., CoMoSpeech), but this introduces additional training stages and dependencies.

ECTSpeech eliminates the need for a separate teacher or distillation procedure by directly fine-tuning a pretrained diffusion model. The ECT strategy incrementally tightens consistency constraints in the model, enabling robust one-step generation while simplifying the training pipeline and minimizing additional training cost or complexity.

2. Easy Consistency Tuning (ECT) Technique

ECTSpeech’s core innovation is its two-stage training regime, leveraging Easy Consistency Tuning:

Stage 1 (Diffusion Pretraining):

An Elucidated Diffusion Model (EDM) is trained to reconstruct a clean mel-spectrogram $x_0$ from a noisy input $x_t = x_0 + t\epsilon$ , optimizing

$L_{\mathrm{EDM}} = \| f_\theta(x_t, t, \mu) - x_0 \|^2$

where $f_\theta$ denotes the U-Net-based denoising network, $t$ is a noise level, and $\mu$ is a speaker conditioning vector.

Stage 2 (Consistency Fine-Tuning):

Only the denoising network is updated, with all other modules (text encoder, duration predictor, etc.) frozen. For each training sample, two noise levels $(t, r)$ with $r \le t$ are sampled:

$x_t = x_0 + t\epsilon, \quad x_r = x_0 + r\epsilon$

The consistency loss is

$L_{\mathrm{ECT}} = \| f_\theta(x_t, t, \mu) - f_\theta^{\mathrm{sg}}(x_r, r, \mu) \|^2$

Here, $f_\theta^{\mathrm{sg}}$ indicates that the output at the lower noise level $r$ is detached from the computational graph for stability. Initially $r$ is close to 0 (approximating the clean input), and as training progresses $r$ moves closer to $t$ , incrementally enforcing consistency over broader noise scales. This facilitates a gradual and stable transition to reliable one-step denoising.

Mask-Based Normalization:

A normalization scheme is used in the loss computation to ensure fair gradient contribution across samples of disparate lengths.

3. Multi-Scale Gate Module (MSGate)

The MSGate is designed to enhance the denoising network’s multi-scale feature representation. Standard U-Net skip connections pass features from encoder to decoder, but lack adaptive fusion. MSGate addresses this via:

Four parallel branches:
- 1×1 convolution (channel-wise local structure)
- 3×3 convolution (local spatial patterns)
- 5×5 convolution (broader context)
- Global pooling with 1×1 convolution and upsampling (global context)
The concatenated outputs are fused via a $1\times1$ convolution and passed through a sigmoid activation to yield adaptive gating weights:

$y = h \odot \sigma(W_{\text{fuse}}([h_{1\times1}; h_{3\times3}; h_{5\times5}; h_{\text{global}}]))$

where $h$ is the skip connection input, $\sigma$ the sigmoid, and $W_{\text{fuse}}$ a learned projection. This adaptively emphasizes informative features at different scales, critical for capturing both fine details and broader structures in speech spectra.

4. Experimental Results and Benchmarks

Experiments on the LJSpeech dataset (13,100 utterances, 80-dim mel-spectrograms, single-speaker) demonstrate:

Audio Quality:

Pretrained ECTSpeech with 50-step sampling: MOS $= 4.35 \pm 0.09$ ; One-step (ECTSpeech after consistency tuning): MOS $= 4.16 \pm 0.08$ .

Objective Measures:

Fréchet Audio Distance (FAD) and Fréchet Distance (FD) show ECTSpeech's output distribution closely matches ground-truth waveforms.

Ablation Studies:

Removing MSGate or mask-based normalization degrades performance; absent consistency tuning, FD increases sharply to 7.97, and MOS drops to 3.41.

Efficiency:

The model omits teacher-student distillation, reducing overall training iterations and complexity. Single-step inference offers substantial speedup relative to multi-step diffusion or consistency-distilled models without a significant drop in audio quality.

5. Architectural Design and Training Regime

ECTSpeech employs a standard TTS front-end (text encoder, duration predictor, length regulator) with a U-Net-based acoustic model, integrating MSGate modules at skip connections.

Training details:

Hardware: NVIDIA RTX 3090 GPU
Optimization: Adam optimizer (pretraining: lr $=1e\text{-}4$ ; tuning: lr $=1e\text{-}5$ )
Training regime: 1.7 million iterations (diffusion pretraining), 170k iterations (ECT fine-tuning)
Frozen Modules: Consistency tuning updates only the denoising network; all other modules are frozen.

This design ensures efficient resource utilization and stability, facilitating direct adaptation of a multi-step diffusion model to a robust single-step generator.

6. Implications, Limitations, and Future Directions

ECTSpeech demonstrates that direct fine-tuning of a pretrained diffusion model using ECT loss achieves state-of-the-art audio quality in one-step TTS synthesis without requiring additional teacher models or complex distillation pipelines. MSGate enables improved spectral detail reconstruction via adaptive multi-scale feature fusion.

Potential avenues for further research include:

Extending the framework to multi-speaker TTS and integrating emotional speech synthesis capabilities.
Exploring lighter denoising network architectures to reduce computational overhead.
Applying ECT to broader generative modeling domains or investigating alternative consistency regularization schedules.

The approach significantly advances the practicality of diffusion-based TTS by unifying high fidelity, efficient inference, and reduced training complexity (Zhu et al., 7 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning (2025)

Follow Topic

Get notified by email when new papers are published related to ECTSpeech.