Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jukebox-5B Encoder: Hierarchical VQ-VAE

Updated 3 December 2025
  • Jukebox-5B Encoder is a hierarchical VQ-VAE architecture that converts 44.1 kHz audio into discrete tokens for generative music modeling.
  • It uses three independently trained VQ-VAE modules with distinct hop factors and residual networks to progressively downsample audio while preserving spectral and temporal fidelity.
  • The encoder is optimized with multi-resolution STFT spectral loss, codebook loss, and commitment loss, ensuring robust reconstruction quality for subsequent autoregressive processing.

Jukebox-5B is a hierarchical vector-quantized variational autoencoder (VQ-VAE) encoder architecture that forms the foundational compression stage of the Jukebox generative music model. Its purpose is to transform high-fidelity, 44.1 kHz raw audio into a sequence of discrete tokens suitable for autoregressive modeling with long-term coherence. The Jukebox-5B encoder employs three separate VQ-VAE modules, each operating at a different temporal scale, to progressively compress the audio while optimizing both time-domain and spectral fidelity (Dhariwal et al., 2020).

1. Multi-Level VQ-VAE Encoder Structure

The Jukebox-5B encoder consists of three independently trained, one-dimensional VQ-VAE autoencoders operating on downmixed mono audio at 44.1 kHz. Each level is responsible for a distinct hop factor: 8× (bottom), 32× (middle), and 128× (top), leading to output code rates of approximately 5.5 kHz, 1.4 kHz, and 345 Hz, respectively.

Downsampling Block (used by all levels):

  • Strided convolution: Conv1DConv1D with filter width 4, stride 2, and padding 1, channel width WW (W=64W=64 for bottom-level, W=32W=32 for middle/top).
  • Residual network: Stack of MM non-causal, dilated residual convolutional layers:
    • M=8M=8 (bottom), M=4M=4 (middle), M=4M=4 (top).
    • Each block: Conv1D(W,W,3,d,d)ReLUConv1D(W,W,3,d,d)AddConv1D(W,W,3,d,d) \rightarrow ReLU \rightarrow Conv1D(W,W,3,d,d) \rightarrow Add %%%%8%%%% ReLU, where dilation dd cycles through {1,3,9,}\{1,3,9,…\}.
  • Channel-mixing convolution: Conv1D(W,64,3,1,1)Conv1D(W,64,3,1,1), mapping to the codebook embedding size D=64D=64.

Each encoder stack consists of multiple such downsampling blocks: 3 (bottom), 5 (middle), and 7 (top), yielding receptive fields of approximately 2 s (bottom), 120 ms (middle), and 480 ms (top) per code.

Discrete bottleneck: After the channel-mixing convolution, latent vectors hsR64h_s\in\mathbb{R}^{64} are quantized via nearest-neighbor lookup into a codebook CR2048×64C\in\mathbb{R}^{2048\times 64}, selecting the embedding with minimum 2\ell_2 distance.

2. Quantization and Objective Functions

The encoder EθE_\theta produces per-hop latent representations hs=Eθ(x)sh_s = E_\theta(x)_s. Quantization converts hsh_s to zq(x)s=ekz_q(x)_s = e_k, where k=argminjhsej2k = \arg\min_j \|h_s - e_j\|_2 for codebook embeddings {ej}\{e_j\}. Jukebox-5B employs four principal loss terms, summed for each level:

  1. Time-domain reconstruction loss:

    Lrec=1Tt=1Txty^t22L_{rec} = \frac{1}{T} \sum_{t=1}^T \|x_t - \hat{y}_t\|^2_2

  2. Multi-resolution STFT spectral loss:

    Lspec==13STFT(x)STFT(y^)2L_{spec} = \sum_{\ell=1}^3 \left\||\text{STFT}_\ell(x)| - |\text{STFT}_\ell(\hat{y})|\right\|_2

    with parameter sets {(nfft,hop,window)}=(2048,240,1200),(1024,120,600),(512,50,240)\{(n_{fft},hop,window)\} = (2048,240,1200), (1024,120,600), (512,50,240).

  3. Codebook loss (EMA update, γ=0.99\gamma=0.99):

    Lvq=1Sssg[hs]ezs22L_{vq} = \frac{1}{S} \sum_s \|\text{sg}[h_s] - e_{z_s}\|^2_2

  4. Commitment loss:

    Lcommit=1Sshssg[ezs]22L_{commit} = \frac{1}{S} \sum_s \|h_s - \text{sg}[e_{z_s}]\|^2_2

The total loss per VQ-VAE level is L=Lrec+Lspec+Lvq+βLcommitL = L_{rec} + L_{spec} + L_{vq} + \beta L_{commit}, with β=0.02\beta=0.02.

Codebook embeddings are updated via exponential moving average as in VQ-VAE 2. To mitigate code underutilization, any “dead code” below a usage threshold is reinitialized with a random encoder output.

3. Data Preparation and Training Regimen

The corpus comprises 1.2 million songs at 44.1 kHz, 32-bit float stereo, randomly downmixed to mono within [–1,1]. Training segments of 9 s (393,216 samples) are randomly extracted. Each VQ-VAE is trained with batch size 256, Adam optimizer (learning rate 3×1043\times 10^{-4}, no weight decay), for 384,618 steps (≈3 days on 256 × V100 GPUs). Weight initialization uses a scale of 0.02. All levels share codebook size K=2048K=2048, embedding dim D=64D=64, with commitment loss coefficient βcommit=0.02\beta_{commit}=0.02 and EMA γ=0.99\gamma=0.99.

4. Empirical Performance Characteristics

Each VQ-VAE level contains approximately 2 million parameters; codebook memory per level is 2048×64×42048\times 64\times 4 bytes 0.5\approx 0.5 MB. Reconstruction fidelity, measured as spectral convergence (dB, lower is better), is:

Level Hop Spectral Conv. (with restarts)
Bottom 8 –23.0 dB
Middle 32 –12.4 dB
Top 128 –8.3 dB

The loss of spectral loss at the top level degrades performance to –6.3 dB. A collapsed single-level hierarchy (VQ-VAE 2 style) yields a spectral convergence increase of at least +3 dB. Bottom-level codebook size ablation: K=25615.9K=256 \rightarrow –15.9 dB, K=204823.0K=2048 \rightarrow –23.0 dB, no quantization (“continuous”): –40.5 dB.

5. Bottlenecks and Ablation Observations

Large hop factors (32×, 128×) at middle/top levels reduce representation of high-frequency structure. While multi-resolution spectral loss recovers some high-frequency content, it can also introduce perceptible artifacts (“scratchiness”). Codebook utilization is actively maintained via embedding reinitialization to avoid dead code vectors.

Single-hierarchy VQ-VAEs are less effective, as evidenced by notably worse spectral convergence. Increased codebook size at the bottom level significantly improves fidelity. Continuous-valued bottlenecks yield far superior reconstruction but cannot be directly sampled by the subsequent autoregressive transformer.

6. Encoder Hyperparameter Specification

All key architectural parameters are summarized below for reproducibility:

Feature Value / Setting
Sample rate 44,100 Hz
Segment length 393,216 samples (~9 s)
Residual block width (WW) 64 (bottom), 32 (middle/top)
Residual blocks per downsampling 8, 4, 4, per level
Hop lengths 8 (bottom), 32 (middle), 128 (top)
Embedding width (DD) 64
Codebook size (KK) 2048
STFT bins/hop/window (2048,240,1200), (1024,120,600), (512,50,240)
βcommit\beta_{commit}, EMA γ\gamma 0.02, 0.99
Optimizer, batch size, steps, lr Adam, 256, 384,618, 3×1043\times 10^{-4}

A reimplementation intended for parity with Jukebox-5B must adhere exactly to these hyperparameters and block arrangements (Dhariwal et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jukebox-5B Encoder.