HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec (2305.02765v2)

Published 4 May 2023 in cs.SD and eess.AS

Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encodec model as an intermediate feature to aid TTS tasks. Despite their usefulness, two challenges persist: (1) training these audio codec models can be difficult due to the lack of publicly available training processes and the need for large-scale data and GPUs; (2) achieving good reconstruction performance requires many codebooks, which increases the burden on generation models. In this study, we propose a group-residual vector quantization (GRVQ) technique and use it to develop a novel \textbf{Hi}gh \textbf{Fi}delity Audio Codec model, HiFi-Codec, which only requires 4 codebooks. We train all the models using publicly available TTS data such as LibriTTS, VCTK, AISHELL, and more, with a total duration of over 1000 hours, using 8 GPUs. Our experimental results show that HiFi-Codec outperforms Encodec in terms of reconstruction performance despite requiring only 4 codebooks. To facilitate research in audio codec and generation, we introduce AcademiCodec, the first open-source audio codec toolkit that offers training codes and pre-trained models for Encodec, SoundStream, and HiFi-Codec. Code and pre-trained model can be found on: \href{https://github.com/yangdongchao/AcademiCodec}{https://github.com/yangdongchao/AcademiCodec}

PDF Abstract

HiFi-Codec: Group-residual Vector Quantization for High Fidelity Audio Codec

In the context of audio codecs, this paper introduces HiFi-Codec, a high-fidelity audio codec model employing a novel Group-residual Vector Quantization (GRVQ) approach. The authors target two primary challenges: efficient training of audio codec models and improving reconstruction performance with fewer codebooks.

Core Contributions and Methodology

The GRVQ method proposed in this paper is central to the development of HiFi-Codec. The approach involves splitting latent features into groups and applying residual vector quantization separately to each group. This technique enhances the information retained in the first layer of codebooks, allowing for a reduction in the total number of codebooks required—a significant advancement over existing models like Encodec and SoundStream.

The architectural framework of HiFi-Codec consists of three primary components: an encoder, a GRVQ layer, and a decoder, drawing inspiration from convolutional architectures used in Encodec and SoundStream. The encoder and decoder are structured as convolutional networks with sequential modeling, designed to efficiently compress and reconstruct audio signals.

Training is guided by a combination of time-domain and frequency-domain losses, adversarial loss from discriminators, and a GRVQ commitment loss. Additionally, the model leverages multi-scale STFT-based, multi-period, and multi-scale discriminators to fine-tune perceptual quality.

Results and Evaluation

The experimental evaluation demonstrates that HiFi-Codec outperforms Encodec in reconstruction quality while utilizing only four codebooks, as indicated by strong numerical results in PESQ and STOI metrics. The paper employs publicly available TTS datasets such as LibriTTS, VCTK, and AISHELL, ensuring a diverse training corpus.

Implications and Future Directions

Practically, HiFi-Codec offers a more resource-efficient solution for audio codec applications, potentially alleviating computation burdens on generation models by necessitating fewer codebooks. Theoretically, the GRVQ method marks a step forward in vector quantization techniques, suggesting new avenues for compression algorithms beyond audio codecs.

This research encourages further developments in codec models tailored for generation tasks, an area with expanding interest due to the rise of tasks like Text-to-Speech (TTS) and music generation. The open-source release of training codes and pre-trained models for Encodec, SoundStream, and HiFi-Codec under the AcademiCodec toolkit enhances replicability and facilitates future work in this domain.

In future explorations, refining the HiFi-Codec model with larger datasets, incorporating subjective evaluation metrics, and rigorous testing on downstream generation tasks will be crucial. These developments could fortify the model’s robustness and applicability across various audio-related tasks.

By addressing and overcoming the typical bottlenecks in audio codec technology, this paper contributes significantly to the field, presenting an efficient, high-performance solution pivotal for both theoretical research and practical applications in audio processing and generation domains.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Dongchao Yang (51 papers)
Songxiang Liu (28 papers)
Rongjie Huang (62 papers)
Jinchuan Tian (33 papers)
Chao Weng (61 papers)
Yuexian Zou (119 papers)

Citations (96)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - yangdongchao/AcademiCodec: AcademiCodec: An Open Source Audio Codec Model for Academic Research (596 stars)