Jukebox-5B Encoder: Hierarchical VQ-VAE
- Jukebox-5B Encoder is a hierarchical VQ-VAE architecture that converts 44.1 kHz audio into discrete tokens for generative music modeling.
- It uses three independently trained VQ-VAE modules with distinct hop factors and residual networks to progressively downsample audio while preserving spectral and temporal fidelity.
- The encoder is optimized with multi-resolution STFT spectral loss, codebook loss, and commitment loss, ensuring robust reconstruction quality for subsequent autoregressive processing.
Jukebox-5B is a hierarchical vector-quantized variational autoencoder (VQ-VAE) encoder architecture that forms the foundational compression stage of the Jukebox generative music model. Its purpose is to transform high-fidelity, 44.1 kHz raw audio into a sequence of discrete tokens suitable for autoregressive modeling with long-term coherence. The Jukebox-5B encoder employs three separate VQ-VAE modules, each operating at a different temporal scale, to progressively compress the audio while optimizing both time-domain and spectral fidelity (Dhariwal et al., 2020).
1. Multi-Level VQ-VAE Encoder Structure
The Jukebox-5B encoder consists of three independently trained, one-dimensional VQ-VAE autoencoders operating on downmixed mono audio at 44.1 kHz. Each level is responsible for a distinct hop factor: 8× (bottom), 32× (middle), and 128× (top), leading to output code rates of approximately 5.5 kHz, 1.4 kHz, and 345 Hz, respectively.
Downsampling Block (used by all levels):
- Strided convolution: with filter width 4, stride 2, and padding 1, channel width ( for bottom-level, for middle/top).
- Residual network: Stack of non-causal, dilated residual convolutional layers:
- (bottom), (middle), (top).
- Each block: , where dilation cycles through .
- Channel-mixing convolution: , mapping to the codebook embedding size .
Each encoder stack consists of multiple such downsampling blocks: 3 (bottom), 5 (middle), and 7 (top), yielding receptive fields of approximately 2 s (bottom), 120 ms (middle), and 480 ms (top) per code.
Discrete bottleneck: After the channel-mixing convolution, latent vectors are quantized via nearest-neighbor lookup into a codebook , selecting the embedding with minimum distance.
2. Quantization and Objective Functions
The encoder produces per-hop latent representations . Quantization converts to , where for codebook embeddings . Jukebox-5B employs four principal loss terms, summed for each level:
- Time-domain reconstruction loss:
- Multi-resolution STFT spectral loss:
with parameter sets .
- Codebook loss (EMA update, ):
- Commitment loss:
The total loss per VQ-VAE level is , with .
Codebook embeddings are updated via exponential moving average as in VQ-VAE 2. To mitigate code underutilization, any “dead code” below a usage threshold is reinitialized with a random encoder output.
3. Data Preparation and Training Regimen
The corpus comprises 1.2 million songs at 44.1 kHz, 32-bit float stereo, randomly downmixed to mono within [–1,1]. Training segments of 9 s (393,216 samples) are randomly extracted. Each VQ-VAE is trained with batch size 256, Adam optimizer (learning rate , no weight decay), for 384,618 steps (≈3 days on 256 × V100 GPUs). Weight initialization uses a scale of 0.02. All levels share codebook size , embedding dim , with commitment loss coefficient and EMA .
4. Empirical Performance Characteristics
Each VQ-VAE level contains approximately 2 million parameters; codebook memory per level is bytes MB. Reconstruction fidelity, measured as spectral convergence (dB, lower is better), is:
| Level | Hop | Spectral Conv. (with restarts) |
|---|---|---|
| Bottom | 8 | –23.0 dB |
| Middle | 32 | –12.4 dB |
| Top | 128 | –8.3 dB |
The loss of spectral loss at the top level degrades performance to –6.3 dB. A collapsed single-level hierarchy (VQ-VAE 2 style) yields a spectral convergence increase of at least +3 dB. Bottom-level codebook size ablation: dB, dB, no quantization (“continuous”): –40.5 dB.
5. Bottlenecks and Ablation Observations
Large hop factors (32×, 128×) at middle/top levels reduce representation of high-frequency structure. While multi-resolution spectral loss recovers some high-frequency content, it can also introduce perceptible artifacts (“scratchiness”). Codebook utilization is actively maintained via embedding reinitialization to avoid dead code vectors.
Single-hierarchy VQ-VAEs are less effective, as evidenced by notably worse spectral convergence. Increased codebook size at the bottom level significantly improves fidelity. Continuous-valued bottlenecks yield far superior reconstruction but cannot be directly sampled by the subsequent autoregressive transformer.
6. Encoder Hyperparameter Specification
All key architectural parameters are summarized below for reproducibility:
| Feature | Value / Setting |
|---|---|
| Sample rate | 44,100 Hz |
| Segment length | 393,216 samples (~9 s) |
| Residual block width () | 64 (bottom), 32 (middle/top) |
| Residual blocks per downsampling | 8, 4, 4, per level |
| Hop lengths | 8 (bottom), 32 (middle), 128 (top) |
| Embedding width () | 64 |
| Codebook size () | 2048 |
| STFT bins/hop/window | (2048,240,1200), (1024,120,600), (512,50,240) |
| , EMA | 0.02, 0.99 |
| Optimizer, batch size, steps, lr | Adam, 256, 384,618, |
A reimplementation intended for parity with Jukebox-5B must adhere exactly to these hyperparameters and block arrangements (Dhariwal et al., 2020).