HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution (2501.10045v1)

Published 17 Jan 2025 in cs.SD and eess.AS

Abstract: The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).

Authors (6)

Shengkui Zhao (21 papers)
Kun Zhou (217 papers)
Zexu Pan (36 papers)
Yukun Ma (33 papers)
Chong Zhang (137 papers)
Bin Ma (78 papers)

Summary

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

The field of speech super-resolution (SR) continues to be a significant area of research within audio processing, driven by applications such as speech quality enhancement and the restoration of historical recordings. The paper "HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution" introduces a novel architecture aimed at addressing limitations in the current generation of SR technologies, especially those that fail to offer high-quality outputs in out-of-domain scenarios.

Overview of HiFi-SR Model

The proposed model, named HiFi-SR, integrates a generative adversarial network (GAN) with a hybrid transformer-convolutional network to tackle the SR problem. The architecture is designed to operate end-to-end, transcending the constraints of intermediate mel-spectrogram representations frequently used in prior research. By unifying the prediction of latent representations with their synthesis into time-domain waveforms through adversarial training, this model seeks to ensure consistent and high-fidelity outputs.

Transformer-Convolutional Generator

At the core of HiFi-SR is a transformer-convolutional generator. The transformer component is adept at encoding the long-term dependencies necessary for capturing high-frequency structures in speech, utilizing the MossFormer2 framework. The convolutional segment, based on the HiFi-GAN generator, then handles the translation of these encoded structures into high-resolution waveforms, accommodating a variety of input speech signals originating from varying sampling rates.

Discriminative Mechanism

To enhance the fidelity of the audio outputs, the authors introduce a multi-band, multi-scale time-frequency discriminator (MBD) in combination with classic GAN constructs like the multi-scale discriminator (MSD) and multi-period discriminator (MPD). The MBD operates directly on complex STFT representations, integrating both amplitude and phase information to effectively prevent high-frequency distortions typical in generative models.

Experimental Results

HiFi-SR is evaluated against several prominent SR models including NVSR and AudioSR. Across various test sets, the HiFi-SR model consistently demonstrates superior performance, reflected in lower Log-spectral Distance (LSD) scores and higher subjective preference in ABX listening tests. Notably, HiFi-SR achieves an average LSD of 0.82 on the VCTK dataset, surpassing rivals like NVSR which scored 0.85. The model's robustness is further validated through tests on out-of-domain datasets, such as EXPRESSO and VocalSet, where HiFi-SR also maintains its leading performance.

Implications and Future Directions

The results highlight HiFi-SR’s potential as a versatile and high-quality SR solution. The unified approach of the transformer-convolutional generator eliminates representation inconsistencies associated with separate-stage models. By extending the SR capability across varied domains and sampling rates (4 kHz to 32 kHz), the model can be effectively applied in diverse scenarios from enhancing voice signals in teleconferencing to rejuvenating archival audio materials.

Moving forward, the development of HiFi-SR sets a precedent for integrated neural architectures in SR tasks. Future research may further explore optimizing the computational efficiency of such models, enhancing real-time adaptability, or integrating additional perceptual metrics to fine-tune subjective audio quality enhancements. Additionally, exploring the generalization capabilities in even broader audio domains, including music and complex soundscapes, might extend the utility of the HiFi-SR framework.

Related Papers

Find Related Papers

GitHub

GitHub - modelscope/ClearerVoice-Studio: An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc. (2,037 stars)

Tweets

https://twitter.com/_akhaliq/status/1881200463208714271

https://twitter.com/cackerman21/status/1885656325054497145

https://twitter.com/arXivGPT/status/1881764808598667389