Papers
Topics
Authors
Recent
Search
2000 character limit reached

HiFi-Codec: High-Fidelity Neural Audio Codec

Updated 14 June 2026
  • HiFi-Codec is a high-fidelity neural audio codec that uses group-residual vector quantization to compress audio efficiently while preserving quality.
  • It reduces model complexity by partitioning latent features into groups, achieving comparable audio quality with fewer codebooks than traditional methods.
  • The open-source AcademiCodec toolkit and extensive TTS training data facilitate accessible research and practical integration into audio synthesis applications.

HiFi-Codec denotes a class of high-fidelity neural audio codecs, with the canonical reference being "HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec" by Yang et al. (Yang et al., 2023). This codec achieves high-quality audio reconstruction at significantly reduced codebook overhead through Group-Residual Vector Quantization (GRVQ), thereby enabling both efficient compression for telecommunication and practical integration into audio generation backends. HiFi-Codec is associated with the open-source AcademiCodec toolkit.

1. Motivation and Background

The primary motivation for HiFi-Codec emerges from the need to provide perceptually lossless (or near-lossless) audio compression while addressing the practical bottlenecks seen in prior neural audio codecs. Standard approaches—such as SoundStream and Encodec—require substantial model complexity and a deep stack of residual vector quantization (RVQ) codebooks (8–12) to approach high-fidelity reconstruction. This hierarchical structure leads to inefficiencies: the earliest codebooks capture most of the semantic and acoustic content, with later codebooks modeling sparse residuals at increasing cost to model size and inference throughput. Furthermore, this overhead carries into generative audio modeling, where each codebook’s explicit token stream inflates sequence lengths and complicates downstream modeling (Yang et al., 2023).

HiFi-Codec specifically seeks to:

  • Reduce the number of codebooks without sacrificing audio quality.
  • Make training and deployment more accessible, relying on publicly available speech datasets and modest GPU resources (8 GPUs, ~1000 hours of speech) (Yang et al., 2023).
  • Facilitate open research via the AcademiCodec toolkit with pre-trained models, recipes, and code (Yang et al., 2023).

2. Architectural Innovations: Group-Residual Vector Quantization

The architectural core of HiFi-Codec is the Group-Residual Vector Quantization (GRVQ) method. Standard RVQ applies sequential quantization stages across the entire latent feature space, leading to inefficiency as additional codebooks store diminishing amounts of information.

HiFi-Codec resolves this by splitting the encoder’s high-dimensional latent representation zRD×T\boldsymbol{z} \in \mathbb{R}^{D \times T'} into GG channel-wise groups. Each group is subjected independently to NqN_q stages of RVQ, followed by concatenation:

z[z1,,zG],ziRD/G×T\boldsymbol{z} \rightarrow [\boldsymbol{z}_1, \dots, \boldsymbol{z}_G], \quad \boldsymbol{z}_i \in \mathbb{R}^{D/G \times T'}

For each group ii: ri(0)=zi,ri(j)=ri(j1)qi,j(ri(j1)),j=1,,Nq\boldsymbol{r}_i^{(0)} = \boldsymbol{z}_i,\quad \boldsymbol{r}_i^{(j)} = \boldsymbol{r}_i^{(j-1)} - q_{i,j}(\boldsymbol{r}_i^{(j-1)}),\quad j = 1, \ldots, N_q

zi^=j=1Nqqi,j(ri(j1))\hat{\boldsymbol{z}_i} = \sum_{j=1}^{N_q} q_{i,j}(\boldsymbol{r}_i^{(j-1)})

The total quantized latent is then

zq=[z1^,,zG^]RD×T\boldsymbol{z}_q = [\hat{\boldsymbol{z}_1}, \dots, \hat{\boldsymbol{z}_G}] \in \mathbb{R}^{D \times T'}

HiFi-Codec uses G=2G=2 groups and Nq=2N_q=2 quantization stages per group, for a total of 4 codebooks. Each quantization GG0 performs a nearest centroid lookup in a codebook of size GG1 (Yang et al., 2023).

This structural partitioning ensures that the information captured in the early codebooks of each group is maximized, eliminating the “wasted” codebook capacity intrinsic to standard deep RVQ hierarchies.

3. End-to-End System Design

HiFi-Codec follows an encoder–quantizer–decoder GAN design:

Encoder:

  • 1D convolutional stem (kernel size 7)
  • A stack of GG2 residual blocks (two 1D-conv layers with kernel size 3 plus skip), each followed by strided convolution for downsampling (stride GG3, kernel GG4)
  • Channel doubling at each downsampling stage
  • Two-layer LSTM for temporal context encoding
  • Final conv layer and projection to latent GG5

GRVQ Quantizer: As described above.

Decoder:

  • Architecturally symmetric to the encoder, replacing each strided downsampling with transposed convolutional upsampling, in reverse order.

Discriminators:

4. Training Paradigm and Loss Functions

Training is carried out on over 1,000 hours of publicly available TTS datasets such as LibriTTS, VCTK, and AISHELL. Typical batches are 16–32 audio waveforms per GPU, over ~1M training steps. Training is feasible with 8 consumer-grade GPUs (Yang et al., 2023).

The generator’s objective is a weighted sum of multi-domain losses: GG6 Where:

  • GG7 combines L1 time-domain waveform loss and multi-window mel-spectrogram L1 loss,
  • GG8 is a hinge-GAN loss over the GG9 discriminators,
  • NqN_q0 is a feature-matching loss over discriminator intermediate layers,
  • NqN_q1 is the GRVQ commitment loss, attracting encoder outputs to centroids.

Hyperparameters NqN_q2 are independently tuned for balance (Yang et al., 2023).

5. Comparative Evaluation and Results

HiFi-Codec is benchmarked against Encodec (Facebook, 12 codebooks), an 8-codebook version of Encodec (“ours”), and SoundStream (12 codebooks, replicated setting). Evaluated on a 24 kHz sample rate, HiFi-Codec achieves the following:

Method Codebooks PESQ STOI
Encodec (Fb) 12 3.21 0.95
Encodec (ours) 8 3.62 0.94
SoundStream 12 3.26 0.95
HiFi-Codec 4 3.63 0.95
HiFi-Codec 8 3.92 0.95

Despite using just 4 codebooks, HiFi-Codec matches or exceeds the PESQ/STOI of leading baselines with double or triple the codebook number, empirically validating the efficiency of GRVQ (Yang et al., 2023).

6. Open-Source Ecosystem: The AcademiCodec Toolkit

HiFi-Codec, along with Encodec and SoundStream reimplementations, is provided in the open-source AcademiCodec toolkit (https://github.com/yangdongchao/AcademiCodec). The toolkit provides:

  • Training code for all three codecs, including recipes for various bitrates and codebook configurations.
  • Pre-trained models for immediate evaluation or downstream integration.
  • Scripts for encoding, decoding, and token extraction for generative modeling.

By releasing both code and pre-trained models, HiFi-Codec significantly reduces barriers to neural codec research and application outside proprietary industrial pipelines.

7. Limitations and Future Directions

The current HiFi-Codec implementation is constrained in several respects:

  • Trained only on ≈1,000 hours of TTS speech; there is no evaluation on general-domain audio such as music or environmental noise. Generalization remains to be established.
  • Evaluation is strictly objective (PESQ, STOI); large-scale MUSHRA or MOS studies are not reported.
  • Performance as a tokenization backend for downstream generation tasks (e.g., TTS, audio inpainting) has not yet been empirically validated.
  • Use of only four codebooks is documented for speech; the scalability to more complex signals or lower bitrates is an open research avenue.

The introduction of GRVQ and open-sourcing via AcademiCodec are notable contributions propelling research forward. A plausible implication is that more efficient, widely accessible HiFi neural codecs can unlock new audio synthesis and compression applications while reducing compute and latency constraints (Yang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HiFi-Codec.