Complex-Valued RVQ-VAE Audio Codec

Updated 28 January 2026

The paper introduces a novel codec architecture that incorporates complex-valued neural operations and residual vector quantization to preserve intrinsic amplitude-phase relationships.
It employs a multi-stage RVQ and a complex encoder-decoder design with residual blocks and axial attention, ensuring efficient and robust audio reconstruction.
Empirical results demonstrate that the codec achieves high-fidelity performance while reducing training steps by nearly an order of magnitude compared to rival models.

A complex-valued residual vector quantization variational autoencoder (RVQ-VAE) audio codec is an end-to-end neural audio coding architecture operating directly on complex-valued spectrograms. Unlike previous frequency-domain neural codecs that represent phase as a separate real-valued channel or ignore it, this approach maintains amplitude-phase coupling throughout the analysis, quantization, and synthesis processes. Eliminating adversarial discriminators and diffusion post-filters, the complex-valued RVQ-VAE achieves high-fidelity, robust phase modeling and exceptional computational efficiency, as exemplified by the EuleroDec system (Cerovaz et al., 24 Jan 2026).

1. Complex-Spectral Representation and Encoder

The codec begins with a complex-valued representation of the input waveform using the short-time Fourier transform (STFT). Given a signal $x[n]$ , the STFT produces

$X_{f,t} = \sum_{m=0}^{N-1} x[n-m] w[m] e^{-i2\pi f m/N} = M_{f,t} e^{i\phi_{f,t}}, \qquad X \in \C^{F \times T}$

where $M_{f,t} = |X_{f,t}|$ is magnitude and $\phi_{f,t} = \arg X_{f,t}$ is phase. By treating $X$ as a single complex tensor, the model intrinsically preserves magnitude-phase relationships.

A complex encoder $E_\theta$ maps $X$ to a lower-resolution, multi-channel complex feature map:

$Z = E_\theta(X) \in \C^{C \times F_r \times T_r}$

The encoder is comprised of five dilated complex residual blocks (dilations $(1,1)$ , $(3,3)$ , $(3,5)$ , $(3,7)$ , $(1,1)$ ), followed by a $3\times7$ complex convolution. Progressive downsampling is carried out in four stages, with anisotropic strides $(2,2)$ , $(2,1)$ , $(2,2)$ , $(2,2)$ and channel widths $[48, 64, 96, 128]$ .

All encoder layers operate in the complex domain, including:

Complex convolutional layers
RMS or batch normalization with $2\times2$ joint whitening over real/imaginary components
modReLU or complex GELU activations
Axial self-attention over frequency or time axis

2. Residual Vector Quantization (RVQ)

Following encoding, the feature map is collapsed along the frequency axis and projected linearly:

$z_e \in \C^{B \times C \times F_r \times T_r} \longrightarrow z_e^\flat \in \C^{B \times D \times T_r}, \quad D = C \cdot F_r$

This is performed by a learned projection $z_e^\flat = W_{in} z_e$ .

An $S$ -stage RVQ is applied with $S$ codebooks $\{\mathcal{E}^{(k)}\}_{k=1}^S$ (each of size $K$ ). For each time frame $n$ , the quantization proceeds iteratively:

Initial residual: $R_n^{(0)} = z_{e,n}$
For each stage $k=1\dots S$ :

$a_n^{(k)} = \arg\min_{i=1\dots K} \| R_n^{(k-1)} - e_i^{(k)} \|_2^2; \qquad R_n^{(k)} = R_n^{(k-1)} - e_{a_n^{(k)}}^{(k)}$

The quantized vector is

$z_{q,n} = \sum_{k=1}^S e_{a_n^{(k)}}^{(k)} \in \C^D$

After inverting $W_{in}$ and un-flattening, this yields $z_q \in \C^{B\times C\times F_r\times T_r}$.

The codebooks are trained using:

A commitment loss with a stop-gradient operator,

$\mathcal{L}_{\mathrm{commit}} = \beta \frac{1}{N} \sum_{n=1}^N \Vert z_{e,n} - \mathrm{sg}[e_{a_n^{(k)}}^{(k)}] \Vert_2^2$

Codebook updates via EMA of assignments and feature sums, with decay increasing over epochs. Infrequently used ("dead") codebook entries are re-initialized from a random current mini-batch embedding with small complex Gaussian noise $\mathcal{CN}(0,\sigma^2 I)$ .

3. Decoder, Loss Functions, and Reconstruction

The decoder $D_\theta$ mirrors the encoder structure, using transposed complex convolutions (mirroring encoder downsampling strides), complex residual blocks, axial attention, and a final $3 \times 7$ complex convolution to produce a reconstructed spectrogram $\hat X \in \C^{F \times T}$. Inversion of the STFT (overlap-add synthesis) yields the time-domain waveform $\hat x[n]$ .

The training objective combines multiple loss terms:

Waveform $L_1$ loss:

$\mathcal{L}_{\mathrm{rec}} = \Vert x - \hat x \Vert_1$

Spectrogram magnitude $L_1$ :

$\mathcal{L}_{\mathrm{STFT}} = \big\| |\mathrm{STFT}(x)| - |\mathrm{STFT}(\hat x)| \big\|_1$

Multi-resolution mel $L_1$ , complex spectral convergence, and optional group-delay distortion (GDD)
Commitment loss as above

The full objective is

$\mathcal{L} = \mathcal{L}_{\mathrm{recon}} + \lambda_{\mathrm{STFT}}\, \mathcal{L}_{\mathrm{STFT}} + \cdots + \lambda_{\mathrm{VQ}}\, \mathcal{L}_{\mathrm{commit}}$

Empirical weights are typically $\lambda_{\mathrm{mel}}=80$ , $\lambda_{\mathrm{cplx}}=80$ , $\lambda_{\mathrm{mrs}}=50$ , $\lambda_{\mathrm{VQ}}=0.1$ , and $\beta=0.05$ .

Crucially, no adversarial discriminators (GANs) or diffusion post-filters are needed to achieve high perceptual and phase fidelity.

4. Architectural and Training Hyperparameters

The STFT uses $N=512$ (win=512, hop=64, Hann window) at 24 kHz. Encoder details:

5 complex residual blocks with dilations $(1,1)$ – $(3,7)$ – $(1,1)$
Complex $3 \times 7$ projection
Downsampling: 4 stages, channels=[48,64,96,128], kernels [(6,6),(6,1),(4,4),(4,4)], strides [(2,2),(2,1),(2,2),(2,2)], paddings [(2,2),(2,0),(1,1),(1,1)]
Axial attention and feed-forward module prior to quantization

Vector quantizer uses $S=12$ stages, $K=2048$ entries, and flattened dimension $D=128 \cdot F_r$ .

Decoder structure mirrors the encoder. Codebooks are warmed up with 30 steps of random centroid sampling plus small noise. Dead-code refresh uses $p_{\mathrm{refresh}}=0.015$ and $\sigma=10^{-3}$ .

Optimizer: AdamW with $(\beta_1=0.9,\beta_2=0.99,\mathrm{wd}=7\cdot10^{-4})$ , learning rate $3 \times 10^{-4}$ (linear warm-up, cosine decay by $10^{-2}$ ). Training uses batch size 16 on LibriTTS-100 h, with steps $\sim$ 35k (6 kbps), $\sim$ 41k (12 kbps)—approximately an order of magnitude fewer than GAN-based baselines. Runtime is $\mathrm{RTF} \approx 0.344$ on NVIDIA RTX 3090.

5. Empirical Results and Ablations

Performance metrics include SI-SDR, PESQ, ESTOI, and GDD. On out-of-domain LibriTTS-other at 6 kbps:

Model	Iters	SI-SDR ↑	PESQ ↑	GDD ↓	ESTOI ↑
EuleroDec	35 k	7.58	2.16	270	0.742
APCodec	700 k	0.35	1.91	596	0.769
EnCodec	500 k	5.59	2.69	604	0.861

EuleroDec matches or surpasses these baselines while requiring 20 times fewer training steps and using neither adversarial nor diffusion modules. Consistent trends are observed at higher bitrates (12 kbps) and in-domain evaluations.

Ablations yield the following insights:

Removing time-axial attention reduces parameter count from 2.35M to 2.07M and slightly degrades metrics: SI-SDR 7.58→7.52, PESQ 2.16→2.05, STOI 0.74→0.72.
Replacing complex autoencoders (AE) with split real-valued AEs: complex AE (22-d hidden) achieves LSD 0.49, PESQ 3.48; split real AE (36-d) yields LSD 0.72, PESQ 2.06. A capacity-matched “extra complex” AE is intermediate. This confirms the efficacy of truly complex-valued layers.

6. Implementation, Stability, and Efficiency

Optimal codec behavior relies on careful codebook management: warmup of EMA decay and centroid initialization prevent premature code clustering, while dead-code refresh maintains high codebook utilization. Complex layers’ phase equivariance and $2\times2$ covariance normalization accelerate learning and generalization, negating the need for adversarial losses.

The architecture combines a compact $\approx2.3$ M parameter network with 12 compact codebooks, attaining convergence in $\sim$ 35–40k steps, versus $500$–$700$k for contemporary VQ-GANs.

A typical implementation pipeline is as follows:

Compute complex STFT of input waveform
Encode spectrogram via complex-valued neural network (CVNN) with residual blocks, modReLU, complex BatchNorm, and axial attention
Collapse spatial dimensions and project to $D$ -dimensional complex vector per frame
Apply $S$ -stage complex RVQ with commitment loss and EMA updates
Decode via transposed complex layers to output spectrogram
Invert STFT, compute combined loss, and optimize with AdamW

This approach yields streaming-capable, high-fidelity, phase-coherent neural audio coding without adversarial or diffusion-based post-processors (Cerovaz et al., 24 Jan 2026).

7. Significance and Context

Preserving complex-valued amplitude-phase coupling across all stages overcomes major limitations of previous spectral-domain codecs, which rely on real-valued encodings. This design eliminates the need for adversarial or diffusion-based components, markedly reduces training cost, and maintains or exceeds the fidelity, phase-coherence, and generalization of much larger and longer-trained baselines. The empirical results and ablation studies underscore the importance of true complex-valued modeling and structured RVQ in efficient, robust neural audio compression for both music and speech domains (Cerovaz et al., 24 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Complex-Valued RVQ-VAE Audio Codec.

Complex-Valued RVQ-VAE Audio Codec

1. Complex-Spectral Representation and Encoder

2. Residual Vector Quantization (RVQ)

3. Decoder, Loss Functions, and Reconstruction

4. Architectural and Training Hyperparameters

5. Empirical Results and Ablations

6. Implementation, Stability, and Efficiency

7. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Complex-Valued RVQ-VAE Audio Codec

1. Complex-Spectral Representation and Encoder

2. Residual Vector Quantization (RVQ)

3. Decoder, Loss Functions, and Reconstruction

4. Architectural and Training Hyperparameters

5. Empirical Results and Ablations

6. Implementation, Stability, and Efficiency

7. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research