Jukebox-5B Encoder: Hierarchical VQ-VAE

Updated 3 December 2025

Jukebox-5B Encoder is a hierarchical VQ-VAE architecture that converts 44.1 kHz audio into discrete tokens for generative music modeling.
It uses three independently trained VQ-VAE modules with distinct hop factors and residual networks to progressively downsample audio while preserving spectral and temporal fidelity.
The encoder is optimized with multi-resolution STFT spectral loss, codebook loss, and commitment loss, ensuring robust reconstruction quality for subsequent autoregressive processing.

Jukebox-5B is a hierarchical vector-quantized variational autoencoder (VQ-VAE) encoder architecture that forms the foundational compression stage of the Jukebox generative music model. Its purpose is to transform high-fidelity, 44.1 kHz raw audio into a sequence of discrete tokens suitable for autoregressive modeling with long-term coherence. The Jukebox-5B encoder employs three separate VQ-VAE modules, each operating at a different temporal scale, to progressively compress the audio while optimizing both time-domain and spectral fidelity (Dhariwal et al., 2020).

1. Multi-Level VQ-VAE Encoder Structure

The Jukebox-5B encoder consists of three independently trained, one-dimensional VQ-VAE autoencoders operating on downmixed mono audio at 44.1 kHz. Each level is responsible for a distinct hop factor: 8× (bottom), 32× (middle), and 128× (top), leading to output code rates of approximately 5.5 kHz, 1.4 kHz, and 345 Hz, respectively.

Downsampling Block (used by all levels):

Strided convolution: $Conv1D$ with filter width 4, stride 2, and padding 1, channel width $W$ ( $W=64$ for bottom-level, $W=32$ for middle/top).
Residual network: Stack of $M$ $M$ non-causal, dilated residual convolutional layers:
- $M=8$ (bottom), $M=4$ (middle), $M=4$ (top).
- Each block: $Conv1D(W,W,3,d,d) \rightarrow ReLU \rightarrow Conv1D(W,W,3,d,d) \rightarrow Add %%%%8%%%% ReLU$ , where dilation $d$ cycles through $W$ 0.
Channel-mixing convolution: $W$ 1, mapping to the codebook embedding size $W$ 2.

Each encoder stack consists of multiple such downsampling blocks: 3 (bottom), 5 (middle), and 7 (top), yielding receptive fields of approximately 2 s (bottom), 120 ms (middle), and 480 ms (top) per code.

Discrete bottleneck: After the channel-mixing convolution, latent vectors $W$ 3 are quantized via nearest-neighbor lookup into a codebook $W$ 4, selecting the embedding with minimum $W$ 5 distance.

2. Quantization and Objective Functions

The encoder $W$ 6 produces per-hop latent representations $W$ 7. Quantization converts $W$ 8 to $W$ 9, where $W=64$ 0 for codebook embeddings $W=64$ 1. Jukebox-5B employs four principal loss terms, summed for each level:

Time-domain reconstruction loss:

$W=64$ 2
Multi-resolution STFT spectral loss:

$W=64$ 3

with parameter sets $W=64$ 4.
Codebook loss (EMA update, $W=64$ 5):

$W=64$ 6
Commitment loss:

$W=64$ 7

The total loss per VQ-VAE level is $W=64$ 8, with $W=64$ 9.

Codebook embeddings are updated via exponential moving average as in VQ-VAE 2. To mitigate code underutilization, any “dead code” below a usage threshold is reinitialized with a random encoder output.

3. Data Preparation and Training Regimen

The corpus comprises 1.2 million songs at 44.1 kHz, 32-bit float stereo, randomly downmixed to mono within [–1,1]. Training segments of 9 s (393,216 samples) are randomly extracted. Each VQ-VAE is trained with batch size 256, Adam optimizer (learning rate $W=32$ 0, no weight decay), for 384,618 steps (≈3 days on 256 × V100 GPUs). Weight initialization uses a scale of 0.02. All levels share codebook size $W=32$ 1, embedding dim $W=32$ 2, with commitment loss coefficient $W=32$ 3 and EMA $W=32$ 4.

4. Empirical Performance Characteristics

Each VQ-VAE level contains approximately 2 million parameters; codebook memory per level is $W=32$ 5 bytes $W=32$ 6 MB. Reconstruction fidelity, measured as spectral convergence (dB, lower is better), is:

Level	Hop	Spectral Conv. (with restarts)
Bottom	8	–23.0 dB
Middle	32	–12.4 dB
Top	128	–8.3 dB

The loss of spectral loss at the top level degrades performance to –6.3 dB. A collapsed single-level hierarchy (VQ-VAE 2 style) yields a spectral convergence increase of at least +3 dB. Bottom-level codebook size ablation: $W=32$ 7 dB, $W=32$ 8 dB, no quantization (“continuous”): –40.5 dB.

5. Bottlenecks and Ablation Observations

Large hop factors (32×, 128×) at middle/top levels reduce representation of high-frequency structure. While multi-resolution spectral loss recovers some high-frequency content, it can also introduce perceptible artifacts (“scratchiness”). Codebook utilization is actively maintained via embedding reinitialization to avoid dead code vectors.

Single-hierarchy VQ-VAEs are less effective, as evidenced by notably worse spectral convergence. Increased codebook size at the bottom level significantly improves fidelity. Continuous-valued bottlenecks yield far superior reconstruction but cannot be directly sampled by the subsequent autoregressive transformer.

6. Encoder Hyperparameter Specification

All key architectural parameters are summarized below for reproducibility:

Feature	Value / Setting
Sample rate	44,100 Hz
Segment length	393,216 samples (~9 s)
Residual block width ( $W=32$ 9)	64 (bottom), 32 (middle/top)
Residual blocks per downsampling	8, 4, 4, per level
Hop lengths	8 (bottom), 32 (middle), 128 (top)
Embedding width ( $M$ 0)	64
Codebook size ( $M$ 1)	2048
STFT bins/hop/window	(2048,240,1200), (1024,120,600), (512,50,240)
$M$ 2, EMA $M$ 3	0.02, 0.99
Optimizer, batch size, steps, lr	Adam, 256, 384,618, $M$ 4

A reimplementation intended for parity with Jukebox-5B must adhere exactly to these hyperparameters and block arrangements (Dhariwal et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Jukebox: A Generative Model for Music (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jukebox-5B Encoder.

Jukebox-5B Encoder: Hierarchical VQ-VAE

1. Multi-Level VQ-VAE Encoder Structure

2. Quantization and Objective Functions

3. Data Preparation and Training Regimen

4. Empirical Performance Characteristics

5. Bottlenecks and Ablation Observations

6. Encoder Hyperparameter Specification

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Jukebox-5B Encoder: Hierarchical VQ-VAE

1. Multi-Level VQ-VAE Encoder Structure

2. Quantization and Objective Functions

3. Data Preparation and Training Regimen

4. Empirical Performance Characteristics

5. Bottlenecks and Ablation Observations

6. Encoder Hyperparameter Specification

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research