Papers
Topics
Authors
Recent
Search
2000 character limit reached

EnCodec: High-Fidelity Neural Audio Codec

Updated 20 February 2026
  • EnCodec is a high-fidelity neural audio codec that uses convolutional encoder-decoder architectures, residual vector quantization, and adversarial training for efficient audio compression.
  • It employs multi-scale spectral losses and streamable, low-latency design to support various bitrates and real-time inference across speech, music, and noisy signals.
  • The system integrates seamlessly with downstream applications like ASR and personalized TTS, outperforming traditional codecs in quality and efficiency.

EnCodec is a high-fidelity, neural audio codec developed for real-time streaming audio compression, leveraging residual vector quantization (RVQ) and end-to-end adversarial training. The system achieves state-of-the-art quality across a broad range of bitrates and content domains—including speech, noisy speech, and music—by combining a powerful convolutional encoder-decoder architecture with multi-scale spectral losses and a multi-scale short-time Fourier transform discriminator. EnCodec's design enables fine-grained control over bitrate, real-time inference, and seamless integration into downstream tasks such as automatic speech recognition (ASR), personalized TTS, and cross-modal applications.

1. System Architecture and Quantization

EnCodec follows an encoder–quantizer–decoder paradigm, employing fully convolutional or SEANet-based stacks with strided downsampling for the encoder, and symmetric deconvolutional stacks for the decoder. The encoder processes raw audio input (e.g., 24 kHz waveform) through a sequence of 1D convolutional residual blocks, followed by bidirectional or causal LSTMs for contextual modeling. The final encoder output is a low-rate latent representation zRTenc×Dencz \in \mathbb{R}^{T_{\mathrm{enc}} \times D_{\mathrm{enc}}}, where TencT_{\mathrm{enc}} is the downsampled temporal dimension and DencD_{\mathrm{enc}} is the embedding dimension (typically D=64D=64 or $128$ for 24/48 kHz audio).

RVQ is used for quantization: NcbN_{\mathrm{cb}} codebooks, each with DcbD_{\mathrm{cb}} entries (often $1024$), are applied in sequence. At each timestep, the residual from the previous quantization is quantized by the next codebook, forming a tuple of code indices (e1,,eNcb)(e_1, \ldots, e_{N_{\mathrm{cb}}}). The effective bitrate is: bitrate=Ncblog2DcbFenc\text{bitrate} = N_{\mathrm{cb}} \cdot \log_2 D_{\mathrm{cb}} \cdot F_{\mathrm{enc}} where Fenc=fs/fdownF_{\mathrm{enc}} = f_s / f_{\mathrm{down}} is the encoded frame rate for input sample rate fsf_s and downsampling factor fdownf_{\mathrm{down}} (Défossez et al., 2022, Dhawan et al., 2024).

The decoder receives the quantized latent, combines (sums or concatenates) the selected codebook vectors, and applies a mirrored stack of transposed convolutions and residual blocks to reconstruct the waveform x^\hat{x}. The architecture is strictly causal for streaming variants, with left-padding only.

2. Training Losses and Discriminator Structure

The default EnCodec training objective is a weighted sum of time-domain loss, multi-resolution spectral loss, and adversarial loss from a multi-scale STFT discriminator (MS-STFTD): L=λtimexx^1+λfreqkSTFTk(x)STFTk(x^)1+λadvLadv\mathcal{L} = \lambda_{\mathrm{time}}\|x - \hat{x}\|_1 + \lambda_{\mathrm{freq}}\sum_{k}\| \mathrm{STFT}_k(x) - \mathrm{STFT}_k(\hat{x}) \|_1 + \lambda_{\mathrm{adv}}\,\mathcal{L}_{\mathrm{adv}} with typical weights λtime=0.1\lambda_{\mathrm{time}}=0.1, λfreq=1\lambda_{\mathrm{freq}}=1, λadv=1\lambda_{\mathrm{adv}}=1 (Dhawan et al., 2024, Défossez et al., 2022). The adversarial loss uses one or more GAN discriminators operating on the real and imaginary components of multi-scale STFTs.

For some variants (e.g., FunCodec (Du et al., 2023)), feature-matching and commitment losses are included:

  • Feature matching encourages generator outputs to match internal discriminator activations.
  • Commitment loss constrains encoded vectors to remain close to RVQ outputs.

A novel loss balancer decouples the relative importance of each component from the scale of its gradients, stabilizing adversarial training and improving convergence (Défossez et al., 2022).

3. Rate Control, Bit Allocation, and Entropy Coding

EnCodec achieves bitrate selection by varying NcbN_{\mathrm{cb}} (the number of codebooks). For example, Ncb=8N_{\mathrm{cb}} = 8 yields $6.4$ kbps at 24 kHz, while Ncb=32N_{\mathrm{cb}} = 32 reaches $24$ kbps. Each codebook index supplies b=log2Dcbb = \log_2 D_{\mathrm{cb}} bits. Structured quantization dropout enables operation at multiple bitrates by randomly masking codebooks at training (Du et al., 2023).

An optional causal Transformer model (5 layers, 8 heads, model dim $200$) can be trained on the sequence of code indices to model residual entropy and losslessly compress further, yielding $25$–40%40\% bitrate savings (e.g., 3.0  kbps1.9  kbps3.0\;\mathrm{kbps} \rightarrow 1.9\;\mathrm{kbps}) with negligible quality drop and real-time throughput (Défossez et al., 2022).

4. Evaluation Metrics and Benchmarks

EnCodec is evaluated with both objective and subjective measures:

  • Objective distortion: log-spectral distance (LSD), SI-SNR, multi-resolution STFT error.
  • Perceptual quality: ViSQOL (\approxMOS proxy), PESQ, and MUSHRA (subjective).
  • Downstream ASR: word error rate (WER) as a function of bitrate.

Reported results indicate:

  • At $6$ kbps (8 codebooks, 24 kHz), dev-clean WER is 2.23%2.23\%, dev-other 6.02%6.02\% (Dhawan et al., 2024).
  • Subjective MUSHRA: EnCodec at $3$ kbps ($67$ pts) outperforms Lyra-v2 (6 kbps, $66$ pts) and Opus (12 kbps, $76$ pts); at 6 kbps, EnCodec outperforms Opus/EVS across clean speech, noisy speech, and music (Défossez et al., 2022).
  • FunCodec reproduces these scores and, at 100 tokens/sec on LibriTTS, matches or exceeds Facebook's EnCodec and other open toolkits in ViSQOL (Du et al., 2023).

Performance degrades gracefully as bitrate decreases, with low-bitrate operation retaining practical usability.

5. Implementation, Inference, and Integration

EnCodec is implemented for both streaming and offline use. Streamability is built via causal convolution, with ~13 ms latency for 24 kHz; non-streamable (batch-normalized) variants operate with higher latency, suitable for music and stereo (Défossez et al., 2022). FunCodec extends EnCodec with:

  • Python API and bash scripts for encoding/decoding audio to indexes or reconstructed waveforms.
  • Seamless ASR integration: replacing fbank features with EnCodec embeddings in FunASR pipelines.
  • Multi-bitrate training and batch inference at $50$–$400$ tokens/sec (Du et al., 2023).

Training recipes specify default hyperparameters: Adam(W) optimizer, learning rate 2×1042 \times 10^{-4}3×1043 \times 10^{-4}, batch sizes of 32–128, up to $300$k updates.

6. Applications and Downstream Use Cases

EnCodec's discrete or continuous embeddings serve as input to ASR systems, enabling competitive or superior WER at much lower bitrate and with reduced data requirements, as demonstrated in Codec-ASR, which further optimizes codebook initialization, embedding aggregation, and training augmentations for improved ASR robustness (Dhawan et al., 2024). Personalized TTS and semantic speech models, such as codec LLMs for VALL-E, leverage EnCodec's tokens directly, improving speaker preservation and sample efficiency (Du et al., 2023).

Cross-modal generation frameworks such as LAV utilize EnCodec embeddings as audio intermediates, mapping them via a linear projection into the style space of vision generators (e.g., StyleGAN2), preserving semantic richness for audio-driven visual synthesis (Jung et al., 15 May 2025).

7. Comparative Performance, Limitations, and Extensions

EnCodec outperforms classical transform codecs (Opus, EVS) and previous neural codecs (SoundStream, Lyra-v2) across different rates and modalities (Défossez et al., 2022, Dhawan et al., 2024, Du et al., 2023). Key factors include the flexibility of RVQ, streamable low-latency operation, adversarial training for perceptual fidelity, and practical entropy coding for further compression.

Limitations include potential model overhead at very high bitrates (>50 kbps), susceptibility to artifacts at extremely low bitrates (<2 kbps), and a small but nonzero increase in inference latency from normalization and entropy buffering. Promising research directions include more efficient sequence models for quantization, adaptive bitrate control via reinforcement learning or content-aware quantization, and fully unified entropy models across codebooks (Défossez et al., 2022).

EnCodec's code, pre-trained models, and reproducible training recipes are open-sourced, with major toolkits (e.g., FunCodec) providing integration into diverse speech and audio pipelines (Du et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EnCodec System.