Papers
Topics
Authors
Recent
Search
2000 character limit reached

Complex-Valued RVQ-VAE Audio Codec

Updated 28 January 2026
  • The paper introduces a novel codec architecture that incorporates complex-valued neural operations and residual vector quantization to preserve intrinsic amplitude-phase relationships.
  • It employs a multi-stage RVQ and a complex encoder-decoder design with residual blocks and axial attention, ensuring efficient and robust audio reconstruction.
  • Empirical results demonstrate that the codec achieves high-fidelity performance while reducing training steps by nearly an order of magnitude compared to rival models.

A complex-valued residual vector quantization variational autoencoder (RVQ-VAE) audio codec is an end-to-end neural audio coding architecture operating directly on complex-valued spectrograms. Unlike previous frequency-domain neural codecs that represent phase as a separate real-valued channel or ignore it, this approach maintains amplitude-phase coupling throughout the analysis, quantization, and synthesis processes. Eliminating adversarial discriminators and diffusion post-filters, the complex-valued RVQ-VAE achieves high-fidelity, robust phase modeling and exceptional computational efficiency, as exemplified by the EuleroDec system (Cerovaz et al., 24 Jan 2026).

1. Complex-Spectral Representation and Encoder

The codec begins with a complex-valued representation of the input waveform using the short-time Fourier transform (STFT). Given a signal x[n]x[n], the STFT produces

$X_{f,t} = \sum_{m=0}^{N-1} x[n-m] w[m] e^{-i2\pi f m/N} = M_{f,t} e^{i\phi_{f,t}}, \qquad X \in \C^{F \times T}$

where Mf,t=Xf,tM_{f,t} = |X_{f,t}| is magnitude and ϕf,t=argXf,t\phi_{f,t} = \arg X_{f,t} is phase. By treating XX as a single complex tensor, the model intrinsically preserves magnitude-phase relationships.

A complex encoder EθE_\theta maps XX to a lower-resolution, multi-channel complex feature map:

$Z = E_\theta(X) \in \C^{C \times F_r \times T_r}$

The encoder is comprised of five dilated complex residual blocks (dilations (1,1)(1,1), (3,3)(3,3), (3,5)(3,5), (3,7)(3,7), (1,1)(1,1)), followed by a 3×73\times7 complex convolution. Progressive downsampling is carried out in four stages, with anisotropic strides (2,2)(2,2), (2,1)(2,1), (2,2)(2,2), (2,2)(2,2) and channel widths [48,64,96,128][48, 64, 96, 128].

All encoder layers operate in the complex domain, including:

  • Complex convolutional layers
  • RMS or batch normalization with 2×22\times2 joint whitening over real/imaginary components
  • modReLU or complex GELU activations
  • Axial self-attention over frequency or time axis

2. Residual Vector Quantization (RVQ)

Following encoding, the feature map is collapsed along the frequency axis and projected linearly:

$z_e \in \C^{B \times C \times F_r \times T_r} \longrightarrow z_e^\flat \in \C^{B \times D \times T_r}, \quad D = C \cdot F_r$

This is performed by a learned projection ze=Winzez_e^\flat = W_{in} z_e.

An SS-stage RVQ is applied with SS codebooks {E(k)}k=1S\{\mathcal{E}^{(k)}\}_{k=1}^S (each of size KK). For each time frame nn, the quantization proceeds iteratively:

  • Initial residual: Rn(0)=ze,nR_n^{(0)} = z_{e,n}
  • For each stage k=1Sk=1\dots S:

an(k)=argmini=1KRn(k1)ei(k)22;Rn(k)=Rn(k1)ean(k)(k)a_n^{(k)} = \arg\min_{i=1\dots K} \| R_n^{(k-1)} - e_i^{(k)} \|_2^2; \qquad R_n^{(k)} = R_n^{(k-1)} - e_{a_n^{(k)}}^{(k)}

  • The quantized vector is

$z_{q,n} = \sum_{k=1}^S e_{a_n^{(k)}}^{(k)} \in \C^D$

After inverting WinW_{in} and un-flattening, this yields $z_q \in \C^{B\times C\times F_r\times T_r}$.

The codebooks are trained using:

  • A commitment loss with a stop-gradient operator,

Lcommit=β1Nn=1Nze,nsg[ean(k)(k)]22\mathcal{L}_{\mathrm{commit}} = \beta \frac{1}{N} \sum_{n=1}^N \Vert z_{e,n} - \mathrm{sg}[e_{a_n^{(k)}}^{(k)}] \Vert_2^2

  • Codebook updates via EMA of assignments and feature sums, with decay increasing over epochs. Infrequently used ("dead") codebook entries are re-initialized from a random current mini-batch embedding with small complex Gaussian noise CN(0,σ2I)\mathcal{CN}(0,\sigma^2 I).

3. Decoder, Loss Functions, and Reconstruction

The decoder DθD_\theta mirrors the encoder structure, using transposed complex convolutions (mirroring encoder downsampling strides), complex residual blocks, axial attention, and a final 3×73 \times 7 complex convolution to produce a reconstructed spectrogram $\hat X \in \C^{F \times T}$. Inversion of the STFT (overlap-add synthesis) yields the time-domain waveform x^[n]\hat x[n].

The training objective combines multiple loss terms:

  • Waveform L1L_1 loss:

Lrec=xx^1\mathcal{L}_{\mathrm{rec}} = \Vert x - \hat x \Vert_1

  • Spectrogram magnitude L1L_1:

LSTFT=STFT(x)STFT(x^)1\mathcal{L}_{\mathrm{STFT}} = \big\| |\mathrm{STFT}(x)| - |\mathrm{STFT}(\hat x)| \big\|_1

  • Multi-resolution mel L1L_1, complex spectral convergence, and optional group-delay distortion (GDD)
  • Commitment loss as above

The full objective is

L=Lrecon+λSTFTLSTFT++λVQLcommit\mathcal{L} = \mathcal{L}_{\mathrm{recon}} + \lambda_{\mathrm{STFT}}\, \mathcal{L}_{\mathrm{STFT}} + \cdots + \lambda_{\mathrm{VQ}}\, \mathcal{L}_{\mathrm{commit}}

Empirical weights are typically λmel=80\lambda_{\mathrm{mel}}=80, λcplx=80\lambda_{\mathrm{cplx}}=80, λmrs=50\lambda_{\mathrm{mrs}}=50, λVQ=0.1\lambda_{\mathrm{VQ}}=0.1, and β=0.05\beta=0.05.

Crucially, no adversarial discriminators (GANs) or diffusion post-filters are needed to achieve high perceptual and phase fidelity.

4. Architectural and Training Hyperparameters

The STFT uses N=512N=512 (win=512, hop=64, Hann window) at 24 kHz. Encoder details:

  • 5 complex residual blocks with dilations (1,1)(1,1)(3,7)(3,7)(1,1)(1,1)
  • Complex 3×73 \times 7 projection
  • Downsampling: 4 stages, channels=[48,64,96,128], kernels [(6,6),(6,1),(4,4),(4,4)], strides [(2,2),(2,1),(2,2),(2,2)], paddings [(2,2),(2,0),(1,1),(1,1)]
  • Axial attention and feed-forward module prior to quantization

Vector quantizer uses S=12S=12 stages, K=2048K=2048 entries, and flattened dimension D=128FrD=128 \cdot F_r.

Decoder structure mirrors the encoder. Codebooks are warmed up with 30 steps of random centroid sampling plus small noise. Dead-code refresh uses prefresh=0.015p_{\mathrm{refresh}}=0.015 and σ=103\sigma=10^{-3}.

Optimizer: AdamW with (β1=0.9,β2=0.99,wd=7104)(\beta_1=0.9,\beta_2=0.99,\mathrm{wd}=7\cdot10^{-4}), learning rate 3×1043 \times 10^{-4} (linear warm-up, cosine decay by 10210^{-2}). Training uses batch size 16 on LibriTTS-100 h, with steps \sim35k (6 kbps), \sim41k (12 kbps)—approximately an order of magnitude fewer than GAN-based baselines. Runtime is RTF0.344\mathrm{RTF} \approx 0.344 on NVIDIA RTX 3090.

5. Empirical Results and Ablations

Performance metrics include SI-SDR, PESQ, ESTOI, and GDD. On out-of-domain LibriTTS-other at 6 kbps:

Model Iters SI-SDR ↑ PESQ ↑ GDD ↓ ESTOI ↑
EuleroDec 35 k 7.58 2.16 270 0.742
APCodec 700 k 0.35 1.91 596 0.769
EnCodec 500 k 5.59 2.69 604 0.861

EuleroDec matches or surpasses these baselines while requiring 20 times fewer training steps and using neither adversarial nor diffusion modules. Consistent trends are observed at higher bitrates (12 kbps) and in-domain evaluations.

Ablations yield the following insights:

  • Removing time-axial attention reduces parameter count from 2.35M to 2.07M and slightly degrades metrics: SI-SDR 7.58→7.52, PESQ 2.16→2.05, STOI 0.74→0.72.
  • Replacing complex autoencoders (AE) with split real-valued AEs: complex AE (22-d hidden) achieves LSD 0.49, PESQ 3.48; split real AE (36-d) yields LSD 0.72, PESQ 2.06. A capacity-matched “extra complex” AE is intermediate. This confirms the efficacy of truly complex-valued layers.

6. Implementation, Stability, and Efficiency

Optimal codec behavior relies on careful codebook management: warmup of EMA decay and centroid initialization prevent premature code clustering, while dead-code refresh maintains high codebook utilization. Complex layers’ phase equivariance and 2×22\times2 covariance normalization accelerate learning and generalization, negating the need for adversarial losses.

The architecture combines a compact 2.3\approx2.3M parameter network with 12 compact codebooks, attaining convergence in \sim35–40k steps, versus $500$–$700$k for contemporary VQ-GANs.

A typical implementation pipeline is as follows:

  • Compute complex STFT of input waveform
  • Encode spectrogram via complex-valued neural network (CVNN) with residual blocks, modReLU, complex BatchNorm, and axial attention
  • Collapse spatial dimensions and project to DD-dimensional complex vector per frame
  • Apply SS-stage complex RVQ with commitment loss and EMA updates
  • Decode via transposed complex layers to output spectrogram
  • Invert STFT, compute combined loss, and optimize with AdamW

This approach yields streaming-capable, high-fidelity, phase-coherent neural audio coding without adversarial or diffusion-based post-processors (Cerovaz et al., 24 Jan 2026).

7. Significance and Context

Preserving complex-valued amplitude-phase coupling across all stages overcomes major limitations of previous spectral-domain codecs, which rely on real-valued encodings. This design eliminates the need for adversarial or diffusion-based components, markedly reduces training cost, and maintains or exceeds the fidelity, phase-coherence, and generalization of much larger and longer-trained baselines. The empirical results and ablation studies underscore the importance of true complex-valued modeling and structured RVQ in efficient, robust neural audio compression for both music and speech domains (Cerovaz et al., 24 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Complex-Valued RVQ-VAE Audio Codec.