HiFi-Codec: High-Fidelity Neural Audio Codec
- HiFi-Codec is a high-fidelity neural audio codec that uses group-residual vector quantization to compress audio efficiently while preserving quality.
- It reduces model complexity by partitioning latent features into groups, achieving comparable audio quality with fewer codebooks than traditional methods.
- The open-source AcademiCodec toolkit and extensive TTS training data facilitate accessible research and practical integration into audio synthesis applications.
HiFi-Codec denotes a class of high-fidelity neural audio codecs, with the canonical reference being "HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec" by Yang et al. (Yang et al., 2023). This codec achieves high-quality audio reconstruction at significantly reduced codebook overhead through Group-Residual Vector Quantization (GRVQ), thereby enabling both efficient compression for telecommunication and practical integration into audio generation backends. HiFi-Codec is associated with the open-source AcademiCodec toolkit.
1. Motivation and Background
The primary motivation for HiFi-Codec emerges from the need to provide perceptually lossless (or near-lossless) audio compression while addressing the practical bottlenecks seen in prior neural audio codecs. Standard approaches—such as SoundStream and Encodec—require substantial model complexity and a deep stack of residual vector quantization (RVQ) codebooks (8–12) to approach high-fidelity reconstruction. This hierarchical structure leads to inefficiencies: the earliest codebooks capture most of the semantic and acoustic content, with later codebooks modeling sparse residuals at increasing cost to model size and inference throughput. Furthermore, this overhead carries into generative audio modeling, where each codebook’s explicit token stream inflates sequence lengths and complicates downstream modeling (Yang et al., 2023).
HiFi-Codec specifically seeks to:
- Reduce the number of codebooks without sacrificing audio quality.
- Make training and deployment more accessible, relying on publicly available speech datasets and modest GPU resources (8 GPUs, ~1000 hours of speech) (Yang et al., 2023).
- Facilitate open research via the AcademiCodec toolkit with pre-trained models, recipes, and code (Yang et al., 2023).
2. Architectural Innovations: Group-Residual Vector Quantization
The architectural core of HiFi-Codec is the Group-Residual Vector Quantization (GRVQ) method. Standard RVQ applies sequential quantization stages across the entire latent feature space, leading to inefficiency as additional codebooks store diminishing amounts of information.
HiFi-Codec resolves this by splitting the encoder’s high-dimensional latent representation into channel-wise groups. Each group is subjected independently to stages of RVQ, followed by concatenation:
For each group :
The total quantized latent is then
HiFi-Codec uses groups and quantization stages per group, for a total of 4 codebooks. Each quantization 0 performs a nearest centroid lookup in a codebook of size 1 (Yang et al., 2023).
This structural partitioning ensures that the information captured in the early codebooks of each group is maximized, eliminating the “wasted” codebook capacity intrinsic to standard deep RVQ hierarchies.
3. End-to-End System Design
HiFi-Codec follows an encoder–quantizer–decoder GAN design:
Encoder:
- 1D convolutional stem (kernel size 7)
- A stack of 2 residual blocks (two 1D-conv layers with kernel size 3 plus skip), each followed by strided convolution for downsampling (stride 3, kernel 4)
- Channel doubling at each downsampling stage
- Two-layer LSTM for temporal context encoding
- Final conv layer and projection to latent 5
GRVQ Quantizer: As described above.
Decoder:
- Architecturally symmetric to the encoder, replacing each strided downsampling with transposed convolutional upsampling, in reverse order.
Discriminators:
- Multi-scale STFT discriminator (MS-STFT)
- Multi-period discriminator (MPD)
- Multi-scale discriminator (MSD) These impose losses in both time and frequency domains, targeting perceptual consistency and artifact minimization (Yang et al., 2023).
4. Training Paradigm and Loss Functions
Training is carried out on over 1,000 hours of publicly available TTS datasets such as LibriTTS, VCTK, and AISHELL. Typical batches are 16–32 audio waveforms per GPU, over ~1M training steps. Training is feasible with 8 consumer-grade GPUs (Yang et al., 2023).
The generator’s objective is a weighted sum of multi-domain losses: 6 Where:
- 7 combines L1 time-domain waveform loss and multi-window mel-spectrogram L1 loss,
- 8 is a hinge-GAN loss over the 9 discriminators,
- 0 is a feature-matching loss over discriminator intermediate layers,
- 1 is the GRVQ commitment loss, attracting encoder outputs to centroids.
Hyperparameters 2 are independently tuned for balance (Yang et al., 2023).
5. Comparative Evaluation and Results
HiFi-Codec is benchmarked against Encodec (Facebook, 12 codebooks), an 8-codebook version of Encodec (“ours”), and SoundStream (12 codebooks, replicated setting). Evaluated on a 24 kHz sample rate, HiFi-Codec achieves the following:
| Method | Codebooks | PESQ | STOI |
|---|---|---|---|
| Encodec (Fb) | 12 | 3.21 | 0.95 |
| Encodec (ours) | 8 | 3.62 | 0.94 |
| SoundStream | 12 | 3.26 | 0.95 |
| HiFi-Codec | 4 | 3.63 | 0.95 |
| HiFi-Codec | 8 | 3.92 | 0.95 |
Despite using just 4 codebooks, HiFi-Codec matches or exceeds the PESQ/STOI of leading baselines with double or triple the codebook number, empirically validating the efficiency of GRVQ (Yang et al., 2023).
6. Open-Source Ecosystem: The AcademiCodec Toolkit
HiFi-Codec, along with Encodec and SoundStream reimplementations, is provided in the open-source AcademiCodec toolkit (https://github.com/yangdongchao/AcademiCodec). The toolkit provides:
- Training code for all three codecs, including recipes for various bitrates and codebook configurations.
- Pre-trained models for immediate evaluation or downstream integration.
- Scripts for encoding, decoding, and token extraction for generative modeling.
By releasing both code and pre-trained models, HiFi-Codec significantly reduces barriers to neural codec research and application outside proprietary industrial pipelines.
7. Limitations and Future Directions
The current HiFi-Codec implementation is constrained in several respects:
- Trained only on ≈1,000 hours of TTS speech; there is no evaluation on general-domain audio such as music or environmental noise. Generalization remains to be established.
- Evaluation is strictly objective (PESQ, STOI); large-scale MUSHRA or MOS studies are not reported.
- Performance as a tokenization backend for downstream generation tasks (e.g., TTS, audio inpainting) has not yet been empirically validated.
- Use of only four codebooks is documented for speech; the scalability to more complex signals or lower bitrates is an open research avenue.
The introduction of GRVQ and open-sourcing via AcademiCodec are notable contributions propelling research forward. A plausible implication is that more efficient, widely accessible HiFi neural codecs can unlock new audio synthesis and compression applications while reducing compute and latency constraints (Yang et al., 2023).