SPUR-Set: Spatial Audio QA Benchmark

Updated 11 November 2025

SPUR-Set is a spatial audio question–answering benchmark that provides controlled supervision for fine-grained 3D spatial reasoning in audio-language models.
It combines real FOA recordings with synthetic simulations to train spatial adapters and enhance models’ capability in complex auditory scene analysis.
Empirical findings show significant gains in spatial consistency, reasoning depth, and reduced error rates when integrating SPUR-Set into model training.

SPUR-Set is a spatial audio question–answering (QA) benchmark designed to provide controlled supervision for fine-grained spatial reasoning in large audio-LLMs (LALMs). Developed within the SPUR framework, it targets the systematic training and quantitative evaluation of models on six complementary spatial reasoning skills, aligning with the goal of equipping LALMs—traditionally limited to monaural inputs—with authentic 3D spatial perception and auditory scene understanding.

1. Definition and Motivation

SPUR-Set is constructed to facilitate both the teaching and rigorous testing of LALMs on spatialized audio reasoning. It offers fine-grained question–answer pairs centered on spatial relationships found in complex auditory scenes, ensuring coverage across relative direction (azimuth), elevation, distance, overlap/interference, group clustering, navigation, and scene-rotation tasks. The dataset explicitly supervises small plug-in spatial adapters, enabling them to imbue frozen LALMs with genuine spatial understanding.

A central premise motivating SPUR-Set is the gap between current LALMs—strong in general audio reasoning but lacking spatial capacities—and the demands of real-world acoustic scene analysis such as source localization, conflict resolution in overlapping environments, and navigation or interaction decisions based on 3D spatial auditory cues.

2. Dataset Composition and Source Material

SPUR-Set comprises both real and simulated First-Order Ambisonics (FOA) recordings:

Real FOA recordings: Multi-event scenes featuring both speech and non-speech events. Data sources include:
- STARSS23 (Sony-TAu Realistic Spatial Soundscapes 2023)
- TAU-NIGENS Spatial Sound Events 2021
- L3DAS23 (Tasks 1–2)
Controlled simulations: 10,000 synthetic FOA mixtures created by convolving dry isolated sources with parameterized room impulse responses, yielding precise control over source placement and room characteristics.
Sample statistics:
- Training: 12,000 QA pairs (2,000 per skill)
- Test: 6,000 QA pairs (1,000 per skill)
- Azimuth angles (θ): uniformly sampled in [–180°, 180°]
- Elevation (φ): uniformly sampled in [–90°, 90°]
- Distance: 0.5 m – 5 m

The dataset is evenly balanced across the four core spatial reasoning dimensions—direction, elevation, distance/proximity, and overlap/interference—operationalized by six well-defined skill domains.

3. Annotation Pipeline and QA Generation

Each audio sample undergoes a six-stage annotation and QA generation protocol:

Source selection: Real FOA data or synthetic mix.
Transcription: Speech detected and transcribed using Whisper, with precise time alignment.
Extraction of spatial metadata: Event class, azimuth (θ), elevation (φ), and distance (d) taken from STARSS23 annotations or simulation ground-truth.
Base captioning: General non-spatial scene description generated by Audio Flamingo 3.
Spatial caption refinement: LLM (GPT-5) injects spatial locatives and modifiers (“front-left,” “slightly above,” “near,” etc.).
Skill-targeted QA generation: LLM prompted to craft open-ended multi-step reasoning QAs, corresponding to each of the six spatial skill definitions.

Each QA instance is a natural-language pair:

Question: Free-form text, explicitly referencing spatial relationships or inference requirements.
Answer: Concise spatial description, e.g., “back-left,” “approximately 3.9 m away,” or a specific event class (“laughter”).

4. Formal Metadata Structure

SPUR-Set follows the STARSS-style annotation convention. Each labeled event $i$ is represented as

$\text{Event}_i = \left(c_i,\, t_i^{\mathrm{start}},\, t_i^{\mathrm{end}},\, \theta_i,\, \phi_i,\, d_i\right)$

where:

$c_i$ — Event class (e.g., “speech,” “keyboard”)
$t_i^{\mathrm{start}},\, t_i^{\mathrm{end}}$ — Onset/offset timestamps
$\theta_i$ — Azimuth (degrees, [–180°, 180°])
$\phi_i$ — Elevation (degrees, [–90°, 90°])
$d_i$ — Radial distance (meters)

Auxiliary supervision for sound event localization and detection (SELD) is included, enabling joint prediction of $(c_i,\, \theta_i,\, \phi_i)$ to stabilize learned spatial priors during model fine-tuning.

5. Dataset Splits and Evaluation Protocols

SPUR-Set is partitioned into training and test sets per spatial skill, each with non-overlapping clips. For real FOA recordings, published SELD splits are respected.

Evaluation metrics:

LLM-as-judge QA scoring:
- Spatial Consistency (scale: 1–10)
- Reasoning Depth (scale: 1–10)
- Relevance (scale: 1–10)
- Error Rate (% of QAs with hallucinations, omissions, contradictions)
SELD metrics (auxiliary):
- Mean localization error (in degrees)
- Localization recall (at threshold)
- F-score
- Error rate

This evaluation framework enables direct, quantitative comparison of mono-audio LALMs and those equipped with spatial adapters trained or fine-tuned using SPUR-Set.

6. Spatial Skills and Example QA Items

SPUR-Set balances six principal spatial reasoning competencies:

Skill Domain	Example Question	Example Answer
Spatial Priority/Source Intent Inference	“During a segment where one vocalization holds a central distant presence, another closer voice interjects briefly with amusement. Which type of vocalization provides that interjection?”	“An instance of laughter.”
Proximity-Based Decision Making	“Which sound is likely to be more affected by the knocking that occurs later?”	“The water tap.”
Directional Conflict Resolution	“Despite keyboard clatter and rustling, from which direction does the speech maintain the most discernible thread?”	“Back-left.”
Group Dynamics/Clustering	“What relationship can you infer between the telephone ringing and the sounds that follow?”	“The ringing initiates overlapping reactions, indicating a response from other sources.”
Navigation/Environmental Awareness	“Where should one turn to engage the lively atmosphere without being overwhelmed by background noise?”	“Towards the back and slightly below.”
Scene Reconfiguration/Mental Rotation	“If you turn to face the speaking voice directly, where would the other opening/closing sound appear?”	“Front-right.”

Sound Event Localization can also be directly queried, e.g.: Q: “Where does the continuous background music appear to be located, and approximately how far away?” A: “Front, about 3.9 m away.”

The balanced distribution of QAs by skill allows the isolation of specific spatial capacities during fine-tuning and ablation.

7. Empirical Findings and Significance

Experimental results indicate that fine-tuning LALMs (examples: Audio Flamingo 3, Qwen-2.5-Omni) on SPUR-Set yields notable improvements in spatial QA performance:

Average LLM-judge QA scores increase to ~7.3/10 (vs. ~4.3–4.8/10 for mono-audio baselines).
Per-skill gains of +2–3 points are observed for Spatial Priority, Proximity, Conflict Resolution, Scene Reconfiguration, and Environmental Awareness.
Enhanced LLM-judge metrics:
- Spatial Consistency: ~4.9 → ~7.6
- Reasoning Depth: ~3.6 → ~5.7
- Relevance: ~5.8 → ~8.2
- Error Rate: ~50% → ~22%
Ablation studies confirm that use of genuine FOA channels, rotation-aware spatial features, and mixing real with simulated data are essential for robust, multi-step spatial reasoning.

These findings demonstrate that SPUR-Set supplies granular spatial supervision necessary for upgrading LALMs from monaural to 3D spatial competence, while general audio-language capacities are preserved. A plausible implication is that the methodologies embodied in SPUR-Set could be leveraged in other domains where spatialized or multimodal reasoning is required.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to SPUR-Set.