RAVE: Real-Time Neural Audio Synthesis

Updated 28 November 2025

The paper introduces RAVE, a VAE-based model achieving real-time, 48 kHz audio synthesis with notable improvements in speed and perceived quality.
RAVE employs a multi-band PQMF decomposition and a tailored encoder-decoder network, enabling flexible fidelity–compactness trade-offs and precise timbre transfer.
Extensions of the model support pitch-conditioned synthesis, speech conversion, and ultra-low latency designs, demonstrating diverse applications in audio compression and voice conversion.

The RAVE (Realtime Audio Variational autoEncoder) model is a neural architecture for efficient, high-quality audio waveform synthesis combining a variational autoencoder (VAE) backbone with a multi-band time-domain representation. Engineered for both generative and interactive applications, RAVE achieves real-time, 48 kHz neural synthesis with explicit latent control, outperforming previous autoregressive and feedforward approaches in speed and perceived quality. Via a two-stage training procedure integrating adversarial finetuning, RAVE allows flexible fidelity–compactness trade-offs, fine-grained timbre transfer, and end-to-end waveform compression. Extensions adapt RAVE for pitch-conditional synthesis, low-latency deployment, and speech conversion at high sampling rates (Caillon et al., 2021, Lee et al., 2022, Caspe et al., 14 Mar 2025, Bargum et al., 29 Aug 2024).

1. Model Architecture

Multi-Band Waveform Decomposition

RAVE employs a 16-band pseudo-quadrature mirror filter (PQMF) bank to decompose a raw waveform $x(t)$ sampled at $\mathrm{sr}$  Hz into $M$ subbands. Each analysis filter $h_m[n]$ is a cosine-modulated version of a prototype lowpass $h[n]$ , with cutoff $f_c = \mathrm{sr} / (2M)$ . The subbands are downsampled by $M$ before encoding; synthesis uses time-reversed filters and upsampling by $M$ , yielding near-perfect reconstruction and drastically reduces computational cost.

Encoder and Decoder Structure

The encoder is a four-layer 1D CNN with hidden sizes $[64, 128, 256, 512]$ , strides $[4, 4, 4, 2]$ , filter width 7, LeakyReLU activations, and batch normalization, mapping each frame to a latent Gaussian posterior $q_\phi(z|x)=\mathcal{N}(\mu,\sigma^2)$ with $d_z=128$ . The decoder mirrors this arrangement, using alternating nearest-neighbor upsampling (×2) and a residual stack of dilated convolutions to reconstruct multiband signals. The decoder splits into three output heads:

Waveform head: $1 \times 1$ convolution with tanh, producing base multiband waveform $\tilde{x}_m(t)$ .
Loudness head: $1 \times 1$ convolution with sigmoid, generating amplitude envelope $a(t)$ .
Noise head: Stochastic component modeled as filtered white noise $\tilde{n}_m(t)$ .

The outputs are aggregated as $\hat{x}_m(t) = a(t)\tilde{x}_m(t) + \tilde{n}_m(t)$ . Final PQMF synthesis reconstructs the full-band waveform (Caillon et al., 2021, Caspe et al., 14 Mar 2025).

2. Latent Space and Fidelity Control

RAVE’s latent space has $d_z=128$ dimensions; post-training singular value decomposition (SVD) reveals that high-variance data often resides in a much lower-dimensional manifold ( $r_f \ll 128$ ). For reconstruction fidelity $f\in[0,1]$ , one can project onto the first $r_f$ principal directions so that $\sum_{i=1}^{r_f}\Sigma_{ii}/\sum_{i=1}^{d}\Sigma_{ii}\geq f$ , injecting isotropic noise in unused dimensions. Typical values: $r_{0.99}\approx 24$ for strings, $r_{0.99}\approx 16$ for speech. Adjusting $f$ enables a fidelity–compactness trade-off for applications such as compression or domain transfer (Caillon et al., 2021).

3. Training Procedure and Losses

Stage 1: Variational Representation Learning

RAVE is trained to maximize the evidence lower bound (ELBO):

$\mathcal{L}_{\mathrm{VAE}}(x) = \mathbb{E}_{q_\phi(z|x)} \bigl[ S(x, \hat{x}) \bigr] + \beta\, D_\mathrm{KL}\bigl(q_\phi(z|x)\,\|\,p(z)\bigr)$

where $S(x, \hat{x})$ is a multiscale spectral distance computed over several STFT window sizes:

$S(x, \hat{x}) = \sum_{n\in\mathcal{N}} \left[ \frac{\|\mathrm{STFT}_n(x) - \mathrm{STFT}_n(\hat{x})\|_F}{\|\mathrm{STFT}_n(x)\|_F} + \log \|\mathrm{STFT}_n(x) - \mathrm{STFT}_n(\hat{x})\|_1 \right]$

$\beta$ weights the KL term. Representation learning is performed with Adam optimizer, batch size 8, up to 1.5 million steps, and includes data augmentation such as dequantization, random cropping, and all-pass filtering.

Stage 2: Adversarial Finetuning

With the encoder frozen, only the decoder $G$ and discriminator $D$ are updated:

Discriminator: Hinge loss:

$\mathcal{L}_D = \mathbb{E}_x [\max(0, 1 - D(x))] + \mathbb{E}_z [\max(0, 1 + D(G(z)))]$

Generator: Adversarial loss $-\mathbb{E}_z [D(G(z))]$ , plus feature matching and spectral losses:

$\mathcal{L}_{\mathrm{gen}} = \mathcal{L}_{\mathrm{adv}} + \mathbb{E}_x \bigl[ S(x, G(z)) + \mathcal{L}_\mathrm{FM}(x, G(z)) \bigr]$

where $\mathcal{L}_\mathrm{FM}$ is computed by comparing intermediate discriminator activations.

4. Performance, Evaluation, and Speed

Computational Benchmarks

RAVE with 16-band decomposition achieves $\approx 20\times$ real-time throughput on CPU, $\approx 240\times$ on GPU at 48 kHz, with a model size of 17.6 million parameters. In comparison: | Model | CPU Rate | GPU Rate | |--------------------|--------------|------------| | NSynth (AR) | 18 Hz | 57 Hz | | SING (non-AR) | 304 kHz | 9.8 MHz | | RAVE (no multiband)| 38 kHz | 3.7 MHz | | RAVE (16-band) | 985 kHz | 11.7 MHz |

Audio Quality

Objective multiscale spectral distance $S$ on strings test set as function of code dimension: | Fidelity $f$ | $r_f$ | $S(x,\hat{x})$ | |--------------|-------|----------------| | 1.00 | 128 | 0.00 | | 0.99 | 24 | 0.05 | | 0.90 | 16 | 0.20 | | 0.80 | 10 | 0.50 |

Subjective Mean Opinion Score (MOS) (1–5 scale): | Model | MOS | 95% CI | |--------------|-------|--------| | Ground truth | 4.21 | ±0.04 | | NSynth | 2.68 | ±0.04 | | SING | 1.15 | ±0.02 | | RAVE | 3.01 | ±0.05 |

RAVE significantly outperforms NSynth and SING ( $p < 0.01$ ) (Caillon et al., 2021).

5. Extensions: Polyphonic, Speech, and Low-Latency Variants

Conditional RAVE for Polyphonic Music

To address failures in reconstructing wide-pitch polyphony (e.g., missing bass in piano), RAVE can be extended with pitch conditioning. This involves concatenating frame-wise MIDI note activations as one-hot vectors to the multi-band input, plus integration via a fully connected layer in the decoder. This model optimizes a conditional ELBO:

$\mathcal{L}_\mathrm{CVAE}(\phi, \theta) = \mathbb{E}_{z \sim q_\phi(z|x, y)}[ D_\mathrm{ms}(x, \hat{x}) ] + \beta \cdot D_\mathrm{KL} [q_\phi(z|x, y) \| p(z)]$

The conditional variant achieves MUSHRA listening test scores (mean ±95%CI): vanilla RAVE $51.4$, simple CVAE $49.4$, proposed pitch-conditioned CVAE $76.7$. This removes “missing bass,” improves harmonic reconstruction, and converges more stably (Lee et al., 2022).

RAVE for Speech: S-RAVE

For speech and voice conversion, S-RAVE reuses the RAVE backbone but guides the latent space towards linguistically meaningful representations via HuBERT-based content distillation and conditions the decoder using FiLM layers modulated by external speaker embeddings. The system is trained with STFT reconstruction, adversarial, and content-distillation losses; causal “cached” convolutions ensure streamability and high throughput. S-RAVE matches DiffVC in intelligibility (WER/CER) and naturalness (MOS), but with over $14\times$ faster real-time factor in CPU inference at 48 kHz. Speaker similarity on unseen identities remains a limitation (Bargum et al., 29 Aug 2024).

Low-Latency BRAVE

For interactive scenarios, latency sources in RAVE were systematically reduced:

Compression ratio reduced: encoder/decoder strides halved, block size to 128 frames ( $\sim$ 6 ms buffering).
PQMF filter attenuation reduced from 100 dB to 40 dB, halving group delay.
Causal-only training; decoder dilations pruned; hidden channels reduced.
Noise generator removed; streaming ported to a custom C++ backend.

The result, BRAVE, is a 4.9 M parameter model with $\leq$ 10 ms latency, jitter below $\pm$ 3 ms, and real-time CPU performance while preserving timbre-transfer quality and pitch/loudness accuracy (Caspe et al., 14 Mar 2025).

6. Applications: Timbre Transfer, Compression, and Voice Conversion

Timbre Transfer: By encoding source audio with a decoder trained on a different domain (e.g., speech vs. violin), RAVE achieves zero-shot timbre transfer: pitch, loudness, and envelope preserved; target timbral texture imposed by the decoder (Caillon et al., 2021).
Signal Compression: At a stride of 2048 and $d_z=128$ , achieves 2048× compression. SVD-based cropping ( $f=0.99$ ) further reduces dimension by $\sim$ 6× with little loss. With a lightweight AR prior on $z$ , regeneration remains $5\times$ faster than real-time (Caillon et al., 2021).
Voice Conversion: S-RAVE summarizes content in a speaker-invariant code, re-synthesizing with target timbre via FiLM modulation; this enables effective cross-speaker voice conversion in high-fidelity, real-time streaming settings (Bargum et al., 29 Aug 2024).

7. Limitations, Open Problems, and Prospects

Polyphonic conditioning requires accurate pitch labels; generalization to multi-pitch contexts demands improved features or attention-based fusion (Lee et al., 2022).
Speaker similarity in S-RAVE degrades for unseen speakers, suggesting further work in robust, zero-shot speaker adaptation (Bargum et al., 29 Aug 2024).
Real-time deployment involves addressing buffering latency, representation delay, and jitter, which can be mitigated via block-size reduction, causal model design, and hardware-aware implementations (Caspe et al., 14 Mar 2025).

RAVE and its derivatives demonstrate that multiband convolutional VAEs supporting adversarial training, post-hoc fidelity–compactness control, and low-latency inference are viable for interactive, controllable, and high-quality neural audio synthesis in both music and speech domains.