Acoustic Wonderland Dataset Overview

Updated 7 August 2025

Acoustic Wonderland Dataset is a large-scale multimodal corpus that benchmarks material-controlled acoustic profile generation in realistic indoor scenes.
It integrates photorealistic Matterport3D reconstructions with paired data modalities including RGB images, segmentation masks, and binaural RIRs.
The dataset facilitates rigorous evaluation through dedicated splits and a novel M‑CAPA model for predicting material-driven acoustic transformations.

The Acoustic Wonderland Dataset is a large-scale, multimodal corpus designed for the development and benchmarking of material-controlled acoustic profile generation methods. Its primary purpose is to support models that can synthesize room impulse responses (RIRs) for arbitrary indoor scenes as a function of user-specified material assignments, enabling accurate simulation of how acoustic properties change in response to modifications such as substituting “brick” for “curtains” or “hardwood” for “carpet.” The dataset is built atop the SoundSpaces 2.0 (SSv2) platform and leverages high-fidelity Matterport3D reconstructions, providing a testbed for challenging generalization scenarios involving realistic geometries, diverse viewpoints, and thousands of possible material configurations (Saad et al., 4 Aug 2025).

1. Dataset Structure and Data Modalities

The Acoustic Wonderland Dataset is grounded in 84 Matterport3D (M3D) environments, each comprising photorealistic 3D mesh reconstructions of real-world indoor spaces. For each scene, the dataset samples $N$ (e.g., 200) random locations. At every sampled position, it uses the SSv2 pipeline to generate the following per-sample data:

Egocentric RGB image ( $V$ ): 90° horizontal field-of-view rendering at the receiver’s location.
Semantic segmentation mask ( $G$ ): Categorical annotation of each visible surface/object.
Material segmentation mask ( $\mathcal{M}$ ): A mapping of object categories to one of 12 canonical material classes (e.g., wood, brick, acoustic tile).
Binaural room impulse response ( $A$ ): The simulated RIR at that position under the specified material configuration.

A key aspect is the “material profile.” For each location, $J$ material profiles are produced by randomly sampling and reassigning material classes to different scene components. This process is repeated at scale (J ≈ 100×N per scene), resulting in $\sim$ 1.68 million unique data samples encompassing the full cross-product of spatial location and material assignments.

Each dataset entry thus consists of paired (view, segmentation, material mask, RIR), supporting explicit control and isolation of material effects on measured acoustics.

2. Material Profile Generation and Pairing Mechanism

Material profiles encode the mapping from object categories (wall, floor, table, bed, sofa, etc.) to canonical materials. The dataset defines 12 material categories and generates 2,673 distinct, plausible profiles by permuting assignments to prominent scene elements. This is operationalized in two core ways:

Material assignment: For every random viewpoint in a scene, the simulator can be re-initialized arbitrarily with a new $\mathcal{M}$ , yielding a set of co-located observations with identical geometry and view but different materialization.
Pairing strategy: Training data for material-aware RIR prediction is constructed by selecting source and target pairs $(V, G, \mathcal{M}_S, A_S) \rightarrow (V, G, \mathcal{M}_T, A_T)$ at each position—with source and target differing only in the material profile.

The dataset is split along three axes to maximize generalization stress: (a) seen vs. unseen scenes ( $\mathbb{S}_s$ , $\mathbb{S}_u$ ), (b) seen vs. unseen material assignments ( $\mathbb{P}_s$ , $\mathbb{P}_u$ ), and (c) seen vs. unseen pairwise material transitions. This design enables rigorous evaluation of a model’s ability to interpolate and extrapolate acoustic effects due to arbitrary material changes.

3. Model Architecture: Material-Controlled Acoustic Profile Anticipation (M‑CAPA)

The core prediction task is, given $(V, G, \mathcal{M}_S, A_S)$ and a user-specified $\mathcal{M}_T$ , to predict the binaural RIR $A_T$ as if the materials had been replaced by $\mathcal{M}_T$ . The paper introduces the M‑CAPA architecture, a multimodal encoder–decoder built as follows:

Multimodal Scene Encoder $f^E$ : Consumes the current observation (RGB $V$ , segmentation $G$ , and source RIR $A_S$ ). Each is encoded with a four-layer UNet-style convolutional encoder, with the RIR transformed through an STFT and encoded as a binaural spectrogram.
Target Material Encoder $f^M$ : Encodes the target material mask $\mathcal{M}_T$ via an analogous convolutional encoder.
Fusion and Decoding ( $f^T$ ): Scene and material embeddings are fused ( $e_m$ , $e_t$ ) using a fusion module $\mathcal{F}$ , followed by a deconvolution-based decoder with skip connections (notably from the acoustic encoder).
Prediction Equation: The target RIR prediction is formulated as

$\hat{A}_T = W_T \odot A_S + B_T$

where $W_T$ is a learned weighting mask, $B_T$ is a learned residual, and $\odot$ denotes pointwise multiplication. This structure allows the model both to modulate salient aspects of the source RIR according to new materials, and to inject entirely new reverberation/damping structure.

Loss Function: Composite loss over time and frequency domains (L1, L2, and energy decay loss $L_D$ ) to enforce fidelity in direct response, late reverberation tail, and energy decay properties (e.g., RT60).

4. Benchmarking, Evaluation Metrics, and Comparative Performance

The dataset supports rigorous evaluation using a set of metrics tailored to RIR prediction:

L1 error: Pointwise absolute error between predicted and ground truth RIRs.
STFT error (MSE): Mean-squared error in the spectral domain.
RT60 (Reverberation Time Error, RTE): Accuracy in predicting the reverberation decay time.
CTE (Early-to-late energy ratio error): Captures changes in direct-to-reverberant balance.

Several baseline and state-of-the-art material-aware and agnostic models (Direct mapping, material matchers, Image2Reverb, FAST-RIR++, AV-RIR) are evaluated. The M‑CAPA model, utilizing both the source RIR and RGB view, consistently outperforms competing methods across all metrics and all splits (seen/unseen scenes, materials, and material-pair transitions). Notably, when only partial scene material changes are made (i.e., updating materials for a subset of surfaces), M‑CAPA preserves low error—demonstrating sensitivity and specificity to material-driven acoustic effects.

5. Impact and Research Significance

By systematically varying material properties while controlling for geometry and location, the Acoustic Wonderland Dataset presents a unique resource for disentangling and learning the mapping from materials to acoustic outcomes under realistic conditions. Existing datasets either (a) lack wide material coverage with realistic geometry, (b) do not provide paired “before” and “after” RIRs under isolated material changes, or (c) do not allow direct regression over material assignments.

This enables research advances in several directions:

Data-driven, material-controlled RIR prediction: Supports research in neural rendering of sound fields where acoustic response must be adapted to user-specified material modifications—relevant for AR/VR, simulation, and creative design.
Generalization benchmarking: The dataset’s split design enables explicit quantification of interpolation vs. extrapolation ability for unseen scenes, unseen materials, and unseen transitions.
Enables explicit material reasoning: The inclusion of semantic segmentation and material masks alongside RGB and RIRs allows for direct use of multimodal and cross-attentional learning mechanisms.

A plausible implication is that this dataset sets a new empirical standard for flexible, physically informed acoustic modeling in indoor environments.

6. Broader Applications and Future Research Directions

The capabilities afforded by the Acoustic Wonderland Dataset advance a set of mission-critical tasks, including:

Customizable reverb synthesis and room tuning in AR/VR pipelines, where user-driven modification of acoustic character is required in real time.
Material-in-the-loop acoustic simulation for content creation, architecture, and building acoustics, enabling fast, learned estimation of how material changes impact perceived sound.
Benchmarking neural architectures for physically-grounded soundfield rendering and transfer learning across real and synthetic scenes.

Ongoing research can leverage the dataset to probe fundamental questions about the disentanglement of geometry and material effects, and to develop new architectures that integrate explicit physical constraints or generalize across broader material vocabularies and real-world soundfield measurements.

In summary, the Acoustic Wonderland Dataset constitutes a comprehensive, systematically-constructed resource designed for the paper and evaluation of material-controlled acoustic profile generation. By isolating material effects in complex, multimodal indoor scenes, and providing rigorous evaluation splits and baseline comparisons, it enables significant progress in the data-driven modeling of material-dependent indoor acoustics (Saad et al., 4 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes (2025)

Follow Topic

Get notified by email when new papers are published related to Acoustic Wonderland Dataset.