VocBench: A Neural Vocoder Benchmark for Speech Synthesis (2112.03099v1)

Published 6 Dec 2021 in cs.SD, cs.CL, and eess.AS

Abstract: Neural vocoders, used for converting the spectral representations of an audio signal to the waveforms, are a commonly used component in speech synthesis pipelines. It focuses on synthesizing waveforms from low-dimensional representation, such as Mel-Spectrograms. In recent years, different approaches have been introduced to develop such vocoders. However, it becomes more challenging to assess these new vocoders and compare their performance to previous ones. To address this problem, we present VocBench, a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them. In our experiments, we use the same setup for datasets, training pipeline, and evaluation metrics for all neural vocoders. We perform a subjective and objective evaluation to compare the performance of each vocoder along a different axis. Our results demonstrate that the framework is capable of showing the competitive efficacy and the quality of the synthesized samples for each vocoder. VocBench framework is available at https://github.com/facebookresearch/vocoder-benchmark.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces VocBench, a unified benchmarking framework that standardizes evaluations for various neural vocoder types in speech synthesis.
It compares autoregressive, GAN-based, and diffusion models across single- and multi-speaker datasets using both subjective (MOS) and objective (FAD, SSIM) metrics.
Results highlight that GAN vocoders offer superior synthesis speed and competitive quality, while speaker generalizability remains a key challenge.

VocBench: A Neural Vocoder Benchmark for Speech Synthesis

The paper "VocBench: A Neural Vocoder Benchmark for Speech Synthesis" introduces VocBench, a comprehensive benchmarking framework centered on evaluating the performance of state-of-the-art neural vocoders within speech synthesis. This framework addresses the existing challenges associated with the performance assessment of various vocoder designs by providing a unified testing environment.

Context and Contributions

Neural vocoders serve as an integral part of speech synthesis pipelines, converting spectral representations like Mel-Spectrograms into time-domain waveforms. With advances from digital signal processing (DSP) methods to approaches rooted in deep learning, neural vocoders have evolved into categories such as autoregressive models, GAN-based models, and diffusion models.

VocBench aims to harmonize the evaluation landscape by offering consistent datasets, training configurations, and evaluation metrics. It focuses on two evaluation themes: the synthesis of waveforms from Mel-Spectrograms and the generalizability of speech synthesis for voices not present in the training dataset.

Experimental Setup and Evaluation

VocBench employs three datasets—LJ Speech, LibriTTS, and VCTK—facilitating single-speaker and multi-speaker scenarios. These datasets are used to train and evaluate six vocoders that span the major methodological families: autoregressive (WaveNet, WaveRNN), GAN-based (MelGAN, Parallel WaveGAN), and diffusion-based (WaveGrad, DiffWave).

The evaluation adopts both subjective metrics like the Mean Opinion Score (MOS) and objective ones, including Structural Similarity Index Measure (SSIM), Fréchet Audio Distance (FAD), Log-mel Spectrogram Mean Squared Error (LS-MSE), and Peak Signal-to-Noise Ratio (PSNR). Results showcase the performance nuances across each vocoder type and illustrate distinctive strengths when assessed with different criteria.

Numerical Findings

A critical insight from the benchmarking process is that GAN-based models generally outperform autoregressive models in terms of synthesis speed and maintain competitive quality scores, as evident from the MOS and FAD metrics. Among GAN vocoders, Parallel WaveGAN shows strong results particularly in computational efficiency, while diffusion models like DiffWave excel in synthesis quality for complex datasets such as VCTK.

The outcomes delineate that speaker generalizability remains a nuanced challenge, particularly evident when comparing single-speaker datasets to multi-speaker ones. Vocoders tend to achieve more robust performance metrics on single-speaker setups like LJ Speech than on multi-speaker datasets, underlining the need for enhanced generalization capabilities in future vocoder research.

Implications and Future Directions

VocBench offers a rigorous reference point for gauging vocoder advancements, promoting research into improved vocoder architectures adept at balancing computational efficiency with synthesis quality. The framework's open-source nature invites contributions from the broader community, encouraging methodological innovations and the expansion of benchmarks that reflect advancements in neural vocoder technologies.

Looking forward, anticipant directions include optimizing vocoder designs for heterogeneous computational environments and enhancing cross-speaker generalizability. Future research could explore hybrid model architectures or the application of transfer learning to boost performance across varied speech synthesis contexts. The promise of VocBench as a tool extends well into facilitating data-driven refinements and empirical explorations in the neural vocoder domain.

Related Papers

GitHub

GitHub - facebookresearch/vocoder-benchmark: A repository for benchmarking neural vocoders by their quality and speed. (201 stars)

Tweets

https://twitter.com/_akhaliq/status/1468074933318307843