Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs

Published 7 Jun 2024 in eess.AS | (2406.05298v2)

Abstract: Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech synthesis tasks such as text-to-speech (TTS). However, the data distribution produced by such codecs is too complex for some TTS models to predict, typically requiring large autoregressive models to get good quality. Most existing audio codecs use Residual Vector Quantization (RVQ) to compress and reconstruct the time-domain audio signal. We propose a spectral codec which uses Finite Scalar Quantization (FSQ) to compress the mel-spectrogram and reconstruct the time-domain audio signal. A study of objective audio quality metrics and subjective listening tests suggests that our spectral codec has comparable perceptual quality to equivalent audio codecs. We show that FSQ, and the use of spectral speech representations, can both improve the performance of parallel TTS models.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a spectral codec that leverages FSQ quantization of mel-spectrograms to enable more effective non-autoregressive TTS synthesis.
It employs a modified HiFi-GAN architecture with multi-band FSQ to simplify latent representations and enhance audio reconstruction.
Results demonstrate comparable reconstruction quality and superior token accuracy, highlighting practical improvements for efficient TTS systems.

Spectral Codecs for Non-Autoregressive Speech Synthesis

Introduction

The development of neural Text-to-Speech (TTS) systems predominantly utilizes the mel-spectrogram as the speech representation, which is advantageous due to its compressibility into a latent space conducive for probabilistic prediction models such as VAEs. Traditional systems faced limitations due to over-smoothness resulting from regression losses. In recent times, discrete audio tokens produced by neural audio codecs have emerged as prominent speech representations for TTS tasks. These tokens, however, present complex data distributions that non-autoregressive models struggle to predict effectively. This work introduces a spectral codec that quantizes mel-spectrogram features to improve audio quality in non-autoregressive TTS.

Methodology

Spectral Codec Architecture

The proposed spectral codec utilizes a modified version of the HiFi-GAN architecture (Figure 1). The architecture encodes mel-spectrograms through residual blocks and quantizes them using Finite Scalar Quantizer (FSQ), allowing non-autoregressive models to predict them with relative ease. Three quantization methods were explored: standard RVQ, FSQ, and multi-band FSQ. The multi-band approach enhances performance by distributing mel-bands equally among separate encoders, creating distinct codebooks for each set.

Figure 1: Architecture of the proposed multi-band spectral codec. Disjoint mel-bands are encoded separately and quantized using an FSQ. All codebook embeddings are concatenated and given as input to a HiFi-GAN decoder which predicts the final waveform.

Audio Codec Architecture

An inverted HiFi-GAN decoder is employed to create a symmetric encoder-decoder audio codec. This codec quantizes time-domain audio signals, similar to spectral codecs but operates over audio waveforms. The architecture includes symmetric upsampling and downsampling rates, maintaining competitive performance across reconstructed time-domain signals.

Training Objective and Evaluation

The models were trained with multi-resolution mel-spectrogram and STFT losses, incorporating discriminator setups to enhance phase estimation. Evaluation employed a combination of instrumental metrics and informal listening tests to assess codec efficiency. TTS models were trained with FastPitch, adjusted for softmax prediction of FSQ-encoded tokens to test synthesis quality.

Results and Discussion

Codec Performance

In evaluating codec performances (Table 1), the spectral codec displayed similar perceptual quality to traditional audio codecs across metrics like ViSQOL and mel-spectrogram reconstruction. While time-domain losses indicated variability, the overall differences were minimal in reconstructions, emphasizing spectral codecs' potential for comparable quality despite the reduced complexity in token generation.

Figure 2: Mel-spectrogram examples: (a) ground truth speech, (b) TTS output generated from the audio codec with RVQ, and (c) TTS output generated from the multi-band spectral codec.

TTS Performance

TTS performance indicated that non-autoregressive models trained on spectral codecs produced significantly higher quality audio compared to those trained on time-domain codecs (Table 2). Spectral codecs using FSQ provided superior synthesis results, achieving clearer and more accurate audio output. These models exhibited enhanced token accuracy, underscoring the advantage of a well-structured flat codebook in FSQ for TTS systems.

Conclusion

This study introduced a spectral codec paradigm that improves TTS outputs by simplifying the latent space predictions via FSQ. Comparisons between mel-spectrogram quantization and time-domain signal compression revealed comparable reconstruction quality, while the former facilitated superior audio synthesis in non-autoregressive models. Future work will expand on these findings across additional speech synthesis systems to further validate the proposed model's efficacy in varying contexts. The implications of shifting towards spectral codecs can influence the design of future high-efficiency TTS architectures.

Markdown Report Issue