PromptReverb: Accurate RIR Synthesis

Updated 29 October 2025

PromptReverb is a generative framework that synthesizes acoustically accurate full-band (48 kHz) room impulse responses from natural language and multimodal cues, overcoming band-limited data constraints.
It employs a two-stage process combining a VAE for upsampling and a conditional diffusion transformer with rectified flow matching to generate latent representations aligned with key acoustic parameters.
The system achieves high perceptual realism with a mean RT60 error as low as 8.8%, making it effective for applications in VR/AR, game audio, architectural acoustics, and creative media production.

PromptReverb is a generative framework designed to synthesize acoustically accurate room impulse responses (RIRs) from natural language descriptions or other multimodal cues. By bridging the gap between limited band-limited datasets and the demand for full-band (48 kHz) RIR synthesis, the system enables high-fidelity generation with perceptual realism and low mean RT60 error. The framework is built on a two-stage process that first reconstructs full-band impulse responses using a variational autoencoder (VAE) and then generates latent representations conditioned on textual or multimodal prompts via a conditional diffusion transformer trained with rectified flow matching.

1. Core Contributions and Objectives

PromptReverb addresses two critical challenges in RIR synthesis:

The scarcity of full-band RIR datasets, as most available data are limited to lower sampling rates.
The lack of models capable of generating acoustically accurate responses from diverse input modalities, particularly natural language.

The method provides a flexible solution by learning to upsample band-limited measurements and by offering a text- (or multimodal-) conditioned generative process. This dual capability results in RIRs with superior perceptual quality and close matching of key acoustic parameters such as RT60.

2. Two-Stage Generative Framework

PromptReverb is structured in two distinct stages:

VAE-Based Full-Band Upsampling The first stage employs a variational autoencoder to convert band-limited RIRs (typically below 24 kHz) into full-band (48 kHz) impulse responses.
- The encoder uses residual blocks (ResBlocks) to process 128 mel-band spectrograms extracted from mono impulse responses, outputting a compact latent vector.
- The decoder integrates a Transformer-based preprocessing step followed by ConvNeXt 1D blocks to reconstruct the full-band waveform.
- Training uses a β-VAE objective (with β = 10⁻⁴) combined with adversarial losses from HiFi-GAN discriminators, accounting for mel-spectrogram MSE, RT60 mean absolute error, and feature matching.
Conditional Diffusion Transformer via Rectified Flow Matching In the second stage, the framework generates new RIR latent representations conditioned on natural language descriptions (or additional modalities).
- A transformer-based diffusion model learns a time-dependent velocity field that converts Gaussian noise into a latent representation, following the trajectory
$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1,\quad t\in[0,1]$

The conditional objective is optimized via rectified flow matching with a pseudo-Huber loss, ensuring the predicted velocity field approximates the actual difference between the target and the initial latent vectors.
Classifier-free guidance is applied at inference to blend conditional and unconditional predictions, thereby controlling adherence to the input prompt (with a guidance scale, e.g., 6.0).
The generated latent code is decoded by the pre-trained VAE decoder to produce the final full-band RIR.

3. Methodology and Mathematical Formulation

PromptReverb’s technical methods combine modern deep learning architectures with precise acoustic modeling:

Variational Autoencoder Formulation:

The encoder maps input mel-spectrograms to a latent space, and the decoder reconstructs signals at 48 kHz using ConvNeXt blocks. The β-VAE loss and adversarial training improve perceptual fidelity.

Rectified Flow Matching:

The diffusion transformer learns a velocity field $v_\theta(\mathbf{x}_t, t, c)$ satisfying the differential equation

$\frac{d\mathbf{x}}{dt} = v_\theta(\mathbf{x}_t, t, c)$

where $c$ represents conditioning from text or multimodal inputs. The training loss is defined as

$\mathcal{L}_\text{FM} = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_1, c} \left[ L_\delta\left(v_\theta(\mathbf{x}_t, t, c) - (\mathbf{x}_1 - \mathbf{x}_0)\right) \right]$

with

$L_\delta(\mathbf{z}) = \delta^2 \left(\sqrt{1 + \left\|\frac{\mathbf{z}}{\delta}\right\|_2^2} - 1\right)$

and $\delta = 1.0$ . Adaptive Runge-Kutta solvers with cosine time reparameterization are used for numerical integration.

Classifier-Free Guidance (CFG):

During inference, the final velocity is computed using

$v_\text{guided} = v_\theta(\mathbf{x}_t, t, c_\text{uncond}) + s\cdot\Big(v_\theta(\mathbf{x}_t, t, c) - v_\theta(\mathbf{x}_t, t, c_\text{uncond})\Big),$

which enhances the model’s sensitivity to the prompt while maintaining diversity in generated outputs.

4. Empirical Evaluation and Quantitative Results

Objective evaluations of PromptReverb demonstrate its effectiveness:

RT60 Error Reduction:

PromptReverb XL achieves a mean RT60 error of 8.8%, in stark contrast to baseline methods (e.g., Image2Reverb with an error of –37%). The accuracy in reproducing reverberation time indicates high acoustic fidelity.

VAE Reconstruction Performance:

The VAE stage produces time-domain reconstructions with significantly higher signal-to-noise ratio (SNR) and lower mean squared error (MSE) compared to traditional phase-recovery methods such as Griffin-Lim. Inference times are over 60× faster.

Comparative Metrics:

Tables comparing models show that PromptReverb scales favorably as larger diffusion transformer models are employed, and its output dynamic range aligns closely with ground-truth measurements.

The following table summarizes key performance metrics related to RT60 error:

Model	Mean RT60 Error (%)
Image2Reverb	–37.0
PromptReverb S	43.4 (long)
PromptReverb B	30.2 (long)
PromptReverb L	24.6 (long)
PromptReverb XL	8.8 (long)

5. Applications and Practical Implications

The advantages of PromptReverb extend to multiple domains:

Virtual and Augmented Reality (VR/AR):

Real-time generation of spatial audio effects tailored to dynamically changing environments improves immersion without demanding exhaustive geometrical or input parameter details.

Game Audio and Multimedia Production:

Designers can quickly prototype auditory environments by describing the desired spatial characteristics (e.g., “cozy jazz club with soft velvet curtains”) without requiring expert-level tuning.

Architectural Acoustics:

Architects can simulate the acoustic impact of various room materials and configurations based solely on natural language descriptions, facilitating iterative design processes.

Audio and Music Production:

Producers benefit from the ability to generate specific reverberation effects that conform to nuanced auditory characteristics, enhancing creative sound design.

6. Conclusions

PromptReverb establishes a rigorous, multimodal framework for generating full-band room impulse responses from natural language and other cues. Its two-stage approach—incorporating both VAE-based upsampling for high-fidelity reconstruction and conditioned diffusion via rectified flow matching for prompt-guided latent generation—delivers superior acoustic accuracy and perceptual realism. By achieving a mean RT60 error as low as 8.8% and offering significant speed advantages in inference, PromptReverb not only advances the state of the art in RIR synthesis but also paves the way for practical applications in immersive audio, architectural design, and creative media production.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to PromptReverb.