Complex-Valued RVQ-VAE Audio Codec
- The paper introduces a novel codec architecture that incorporates complex-valued neural operations and residual vector quantization to preserve intrinsic amplitude-phase relationships.
- It employs a multi-stage RVQ and a complex encoder-decoder design with residual blocks and axial attention, ensuring efficient and robust audio reconstruction.
- Empirical results demonstrate that the codec achieves high-fidelity performance while reducing training steps by nearly an order of magnitude compared to rival models.
A complex-valued residual vector quantization variational autoencoder (RVQ-VAE) audio codec is an end-to-end neural audio coding architecture operating directly on complex-valued spectrograms. Unlike previous frequency-domain neural codecs that represent phase as a separate real-valued channel or ignore it, this approach maintains amplitude-phase coupling throughout the analysis, quantization, and synthesis processes. Eliminating adversarial discriminators and diffusion post-filters, the complex-valued RVQ-VAE achieves high-fidelity, robust phase modeling and exceptional computational efficiency, as exemplified by the EuleroDec system (Cerovaz et al., 24 Jan 2026).
1. Complex-Spectral Representation and Encoder
The codec begins with a complex-valued representation of the input waveform using the short-time Fourier transform (STFT). Given a signal , the STFT produces
$X_{f,t} = \sum_{m=0}^{N-1} x[n-m] w[m] e^{-i2\pi f m/N} = M_{f,t} e^{i\phi_{f,t}}, \qquad X \in \C^{F \times T}$
where is magnitude and is phase. By treating as a single complex tensor, the model intrinsically preserves magnitude-phase relationships.
A complex encoder maps to a lower-resolution, multi-channel complex feature map:
$Z = E_\theta(X) \in \C^{C \times F_r \times T_r}$
The encoder is comprised of five dilated complex residual blocks (dilations , , , , ), followed by a complex convolution. Progressive downsampling is carried out in four stages, with anisotropic strides , , , and channel widths .
All encoder layers operate in the complex domain, including:
- Complex convolutional layers
- RMS or batch normalization with joint whitening over real/imaginary components
- modReLU or complex GELU activations
- Axial self-attention over frequency or time axis
2. Residual Vector Quantization (RVQ)
Following encoding, the feature map is collapsed along the frequency axis and projected linearly:
$z_e \in \C^{B \times C \times F_r \times T_r} \longrightarrow z_e^\flat \in \C^{B \times D \times T_r}, \quad D = C \cdot F_r$
This is performed by a learned projection .
An -stage RVQ is applied with codebooks (each of size ). For each time frame , the quantization proceeds iteratively:
- Initial residual:
- For each stage :
- The quantized vector is
$z_{q,n} = \sum_{k=1}^S e_{a_n^{(k)}}^{(k)} \in \C^D$
After inverting and un-flattening, this yields $z_q \in \C^{B\times C\times F_r\times T_r}$.
The codebooks are trained using:
- A commitment loss with a stop-gradient operator,
- Codebook updates via EMA of assignments and feature sums, with decay increasing over epochs. Infrequently used ("dead") codebook entries are re-initialized from a random current mini-batch embedding with small complex Gaussian noise .
3. Decoder, Loss Functions, and Reconstruction
The decoder mirrors the encoder structure, using transposed complex convolutions (mirroring encoder downsampling strides), complex residual blocks, axial attention, and a final complex convolution to produce a reconstructed spectrogram $\hat X \in \C^{F \times T}$. Inversion of the STFT (overlap-add synthesis) yields the time-domain waveform .
The training objective combines multiple loss terms:
- Waveform loss:
- Spectrogram magnitude :
- Multi-resolution mel , complex spectral convergence, and optional group-delay distortion (GDD)
- Commitment loss as above
The full objective is
Empirical weights are typically , , , , and .
Crucially, no adversarial discriminators (GANs) or diffusion post-filters are needed to achieve high perceptual and phase fidelity.
4. Architectural and Training Hyperparameters
The STFT uses (win=512, hop=64, Hann window) at 24 kHz. Encoder details:
- 5 complex residual blocks with dilations ––
- Complex projection
- Downsampling: 4 stages, channels=[48,64,96,128], kernels [(6,6),(6,1),(4,4),(4,4)], strides [(2,2),(2,1),(2,2),(2,2)], paddings [(2,2),(2,0),(1,1),(1,1)]
- Axial attention and feed-forward module prior to quantization
Vector quantizer uses stages, entries, and flattened dimension .
Decoder structure mirrors the encoder. Codebooks are warmed up with 30 steps of random centroid sampling plus small noise. Dead-code refresh uses and .
Optimizer: AdamW with , learning rate (linear warm-up, cosine decay by ). Training uses batch size 16 on LibriTTS-100 h, with steps 35k (6 kbps), 41k (12 kbps)—approximately an order of magnitude fewer than GAN-based baselines. Runtime is on NVIDIA RTX 3090.
5. Empirical Results and Ablations
Performance metrics include SI-SDR, PESQ, ESTOI, and GDD. On out-of-domain LibriTTS-other at 6 kbps:
| Model | Iters | SI-SDR ↑ | PESQ ↑ | GDD ↓ | ESTOI ↑ |
|---|---|---|---|---|---|
| EuleroDec | 35 k | 7.58 | 2.16 | 270 | 0.742 |
| APCodec | 700 k | 0.35 | 1.91 | 596 | 0.769 |
| EnCodec | 500 k | 5.59 | 2.69 | 604 | 0.861 |
EuleroDec matches or surpasses these baselines while requiring 20 times fewer training steps and using neither adversarial nor diffusion modules. Consistent trends are observed at higher bitrates (12 kbps) and in-domain evaluations.
Ablations yield the following insights:
- Removing time-axial attention reduces parameter count from 2.35M to 2.07M and slightly degrades metrics: SI-SDR 7.58→7.52, PESQ 2.16→2.05, STOI 0.74→0.72.
- Replacing complex autoencoders (AE) with split real-valued AEs: complex AE (22-d hidden) achieves LSD 0.49, PESQ 3.48; split real AE (36-d) yields LSD 0.72, PESQ 2.06. A capacity-matched “extra complex” AE is intermediate. This confirms the efficacy of truly complex-valued layers.
6. Implementation, Stability, and Efficiency
Optimal codec behavior relies on careful codebook management: warmup of EMA decay and centroid initialization prevent premature code clustering, while dead-code refresh maintains high codebook utilization. Complex layers’ phase equivariance and covariance normalization accelerate learning and generalization, negating the need for adversarial losses.
The architecture combines a compact M parameter network with 12 compact codebooks, attaining convergence in 35–40k steps, versus $500$–$700$k for contemporary VQ-GANs.
A typical implementation pipeline is as follows:
- Compute complex STFT of input waveform
- Encode spectrogram via complex-valued neural network (CVNN) with residual blocks, modReLU, complex BatchNorm, and axial attention
- Collapse spatial dimensions and project to -dimensional complex vector per frame
- Apply -stage complex RVQ with commitment loss and EMA updates
- Decode via transposed complex layers to output spectrogram
- Invert STFT, compute combined loss, and optimize with AdamW
This approach yields streaming-capable, high-fidelity, phase-coherent neural audio coding without adversarial or diffusion-based post-processors (Cerovaz et al., 24 Jan 2026).
7. Significance and Context
Preserving complex-valued amplitude-phase coupling across all stages overcomes major limitations of previous spectral-domain codecs, which rely on real-valued encodings. This design eliminates the need for adversarial or diffusion-based components, markedly reduces training cost, and maintains or exceeds the fidelity, phase-coherence, and generalization of much larger and longer-trained baselines. The empirical results and ablation studies underscore the importance of true complex-valued modeling and structured RVQ in efficient, robust neural audio compression for both music and speech domains (Cerovaz et al., 24 Jan 2026).