Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jukebox-5B Encoder: Hierarchical VQ-VAE

Updated 3 December 2025
  • Jukebox-5B Encoder is a hierarchical VQ-VAE architecture that converts 44.1 kHz audio into discrete tokens for generative music modeling.
  • It uses three independently trained VQ-VAE modules with distinct hop factors and residual networks to progressively downsample audio while preserving spectral and temporal fidelity.
  • The encoder is optimized with multi-resolution STFT spectral loss, codebook loss, and commitment loss, ensuring robust reconstruction quality for subsequent autoregressive processing.

Jukebox-5B is a hierarchical vector-quantized variational autoencoder (VQ-VAE) encoder architecture that forms the foundational compression stage of the Jukebox generative music model. Its purpose is to transform high-fidelity, 44.1 kHz raw audio into a sequence of discrete tokens suitable for autoregressive modeling with long-term coherence. The Jukebox-5B encoder employs three separate VQ-VAE modules, each operating at a different temporal scale, to progressively compress the audio while optimizing both time-domain and spectral fidelity (Dhariwal et al., 2020).

1. Multi-Level VQ-VAE Encoder Structure

The Jukebox-5B encoder consists of three independently trained, one-dimensional VQ-VAE autoencoders operating on downmixed mono audio at 44.1 kHz. Each level is responsible for a distinct hop factor: 8× (bottom), 32× (middle), and 128× (top), leading to output code rates of approximately 5.5 kHz, 1.4 kHz, and 345 Hz, respectively.

Downsampling Block (used by all levels):

  • Strided convolution: Conv1DConv1D with filter width 4, stride 2, and padding 1, channel width WW (W=64W=64 for bottom-level, W=32W=32 for middle/top).
  • Residual network: Stack of MM non-causal, dilated residual convolutional layers:
    • M=8M=8 (bottom), M=4M=4 (middle), M=4M=4 (top).
    • Each block: Conv1D(W,W,3,d,d)ReLUConv1D(W,W,3,d,d)AddConv1D(W,W,3,d,d) \rightarrow ReLU \rightarrow Conv1D(W,W,3,d,d) \rightarrow Add %%%%8%%%% ReLU, where dilation dd cycles through WW0.
  • Channel-mixing convolution: WW1, mapping to the codebook embedding size WW2.

Each encoder stack consists of multiple such downsampling blocks: 3 (bottom), 5 (middle), and 7 (top), yielding receptive fields of approximately 2 s (bottom), 120 ms (middle), and 480 ms (top) per code.

Discrete bottleneck: After the channel-mixing convolution, latent vectors WW3 are quantized via nearest-neighbor lookup into a codebook WW4, selecting the embedding with minimum WW5 distance.

2. Quantization and Objective Functions

The encoder WW6 produces per-hop latent representations WW7. Quantization converts WW8 to WW9, where W=64W=640 for codebook embeddings W=64W=641. Jukebox-5B employs four principal loss terms, summed for each level:

  1. Time-domain reconstruction loss:

    W=64W=642

  2. Multi-resolution STFT spectral loss:

    W=64W=643

    with parameter sets W=64W=644.

  3. Codebook loss (EMA update, W=64W=645):

    W=64W=646

  4. Commitment loss:

    W=64W=647

The total loss per VQ-VAE level is W=64W=648, with W=64W=649.

Codebook embeddings are updated via exponential moving average as in VQ-VAE 2. To mitigate code underutilization, any “dead code” below a usage threshold is reinitialized with a random encoder output.

3. Data Preparation and Training Regimen

The corpus comprises 1.2 million songs at 44.1 kHz, 32-bit float stereo, randomly downmixed to mono within [–1,1]. Training segments of 9 s (393,216 samples) are randomly extracted. Each VQ-VAE is trained with batch size 256, Adam optimizer (learning rate W=32W=320, no weight decay), for 384,618 steps (≈3 days on 256 × V100 GPUs). Weight initialization uses a scale of 0.02. All levels share codebook size W=32W=321, embedding dim W=32W=322, with commitment loss coefficient W=32W=323 and EMA W=32W=324.

4. Empirical Performance Characteristics

Each VQ-VAE level contains approximately 2 million parameters; codebook memory per level is W=32W=325 bytes W=32W=326 MB. Reconstruction fidelity, measured as spectral convergence (dB, lower is better), is:

Level Hop Spectral Conv. (with restarts)
Bottom 8 –23.0 dB
Middle 32 –12.4 dB
Top 128 –8.3 dB

The loss of spectral loss at the top level degrades performance to –6.3 dB. A collapsed single-level hierarchy (VQ-VAE 2 style) yields a spectral convergence increase of at least +3 dB. Bottom-level codebook size ablation: W=32W=327 dB, W=32W=328 dB, no quantization (“continuous”): –40.5 dB.

5. Bottlenecks and Ablation Observations

Large hop factors (32×, 128×) at middle/top levels reduce representation of high-frequency structure. While multi-resolution spectral loss recovers some high-frequency content, it can also introduce perceptible artifacts (“scratchiness”). Codebook utilization is actively maintained via embedding reinitialization to avoid dead code vectors.

Single-hierarchy VQ-VAEs are less effective, as evidenced by notably worse spectral convergence. Increased codebook size at the bottom level significantly improves fidelity. Continuous-valued bottlenecks yield far superior reconstruction but cannot be directly sampled by the subsequent autoregressive transformer.

6. Encoder Hyperparameter Specification

All key architectural parameters are summarized below for reproducibility:

Feature Value / Setting
Sample rate 44,100 Hz
Segment length 393,216 samples (~9 s)
Residual block width (W=32W=329) 64 (bottom), 32 (middle/top)
Residual blocks per downsampling 8, 4, 4, per level
Hop lengths 8 (bottom), 32 (middle), 128 (top)
Embedding width (MM0) 64
Codebook size (MM1) 2048
STFT bins/hop/window (2048,240,1200), (1024,120,600), (512,50,240)
MM2, EMA MM3 0.02, 0.99
Optimizer, batch size, steps, lr Adam, 256, 384,618, MM4

A reimplementation intended for parity with Jukebox-5B must adhere exactly to these hyperparameters and block arrangements (Dhariwal et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jukebox-5B Encoder.