Speech-to-Spatial: Mapping Audio to Space

Updated 4 July 2026

Speech-to-Spatial is a research paradigm that maps auditory signals to spatial structures, integrating geometry, acoustics, and room dynamics.
It employs techniques like latent space clustering, spatial cue extraction, and multichannel filtering to improve ASR and sound source localization.
Researchers balance trade-offs between preserving spatial fidelity and achieving robust performance amid complex, reverberant acoustic environments.

Speech-to-Spatial denotes a family of research problems in which speech or speech-derived signals are mapped to explicitly spatial structure rather than treated as purely linguistic or monaural content. In current usage, that structure may be the geometry of a learned representation space, physical source direction and distance, room-acoustic parameters, spatially informed ASR embeddings, multichannel filters that preserve inter-channel cues, scene-aware synthesis controls, or object- and place-grounded actions in augmented reality and robotics (Riera et al., 2023, Sarabia et al., 2023, Shao et al., 25 Jan 2026, Kim et al., 3 Feb 2026). The unifying premise is that speech carries information not only about phonetic and semantic content, but also about where sound originates, how it propagates, how it should be rendered, and how verbal references should be grounded in space.

1. Scope and problem formulations

The term covers several distinct but related formulations. Some studies ask how speech is already organized in latent spaces; others infer spatial or room-acoustic attributes from audio; others inject spatial cues directly into recognition, enhancement, coding, or synthesis pipelines; and a broader multimodal strand grounds spoken references into embodied action or spatial-temporal motion (Riera et al., 2023, Sarabia et al., 2023, Tang et al., 2024, Liu et al., 31 Jan 2025).

Formulation	Representative output	Example papers
Representation-space organization	phone/speaker clustering, CKA structure	(Riera et al., 2023)
Audio-to-physical-space inference	localization, distance, DRR, T30	(Sarabia et al., 2023, Tang et al., 2024)
Spatially informed recognition	one-stage ASR embeddings, target-aware features	(Shao et al., 25 Jan 2026, Shao, 2023)
Spatial cue preservation and reconstruction	binaural or multichannel enhanced output, beamforming, coding	(Han et al., 2020, Togami et al., 2024, Xu et al., 2023)
Scene-aware generation and grounding	immersive TTS, spatial VC, AR referent grounding, navigation	(Zhang et al., 2024, Seki et al., 2024, Kim et al., 3 Feb 2026, Taniguchi et al., 2020)

A common misconception is that Speech-to-Spatial is identical to source localization. The literature is materially broader. Spatial LibriSpeech, for example, treats source localization, source distance estimation, third-octave-band DRR estimation, and third-octave-band T30 estimation as coequal speech-to-spatial tasks, and explicitly includes labels for source position, speaking direction, room acoustics and geometry (Sarabia et al., 2023). Likewise, spatial ASR work treats spatial information not as an end in itself but as an internal representation that improves transcription or target-speaker selection (Shao et al., 25 Jan 2026).

2. Geometry of speech representations

A representation-centric formulation appears in "Phone and speaker spatial organization in self-supervised speech representations" (Riera et al., 2023). The paper asks not merely whether phone or speaker information is recoverable from a representation, but how that information is arranged in the embedding space itself. It analyzes Mockingjay, DeCoAR2, HuBERT Base, WavLM Base+, wav2vec 2.0 Base, data2vec, and classical features including MFCCs, mel-spectrograms, and Kaldi filter banks, using L2Arctic and TIMIT. Each phone instance is treated as a sample, and frame-level representations are averaged over the phone.

Two model-free tools are central. The first is linear CKA, defined as

$\mathrm{CKA}(X,Y)=\mathrm{corr}\left(\mathrm{vec}(XX^T),\mathrm{vec}(YY^T)\right),$

which compares pairwise self-similarity structure across spaces with different dimensionalities. The second is a multivariate Wilcoxon–Mann–Whitney-style statistic summarized by

$\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$

where larger values indicate that same-class points are closer than different-class points. In this formulation, high phone $\mathrm{AvgU}$ means phone tokens cluster well, and high speaker $\mathrm{AvgU}$ means speaker tokens cluster well (Riera et al., 2023).

The findings establish a genuinely spatial view of self-supervised speech models. F0 and spectral centroid, which are more related to speaker identity, are preserved more strongly in early layers for WavLM, HuBERT, wav2vec 2.0, and data2vec, then gradually lost in deeper layers; DeCoAR2 and Mockingjay instead retain similar or higher CKA in later layers. For F1 and F2, which are more tied to vowel or phone identity, CKA is generally higher, with HuBERT, WavLM, wav2vec 2.0, and DeCoAR2 peaking in layers 2–4. Phone $\mathrm{AvgU}$ is generally higher than speaker $\mathrm{AvgU}$ , and the paper identifies a trade-off: if a representation clusters speakers, phones tend to be spread across those speaker clusters; if it clusters phones, samples from different speakers overlap more (Riera et al., 2023).

This line of work shifts the question from content availability to representational topology. A plausible implication is that downstream success depends not only on whether information is encoded, but on whether the geometry of the raw representation already aligns with the target task. The reported correlations with SUPERB outcomes support that interpretation: phone $\mathrm{AvgU}$ versus phone recognition gives $r = 0.84, p = 0.018$ , and speaker $\mathrm{AvgU}$ versus speaker identification gives $r = 0.75, p = 0.05$ (Riera et al., 2023).

3. Spatial features, datasets, and physical attribute inference

Speech-to-Spatial research depends heavily on feature design and data resources that expose source and room attributes. "Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning" (Sarabia et al., 2023) is a central benchmark in this regard. It provides over 650 hours of spatialized speech built from LibriSpeech, with 19-channel microphone-array audio, first-order ambisonics, optional distractor noise, and labels for azimuth, elevation, distance, speaking direction, voice directivity identifier, room volume, surface area, floor area, C50, DRR, EDT, T20, and T30. It is generated from 200k+ simulated acoustic conditions across 8k+ synthetic rooms.

The benchmark formalizes several speech-to-spatial targets. For 3D source localization, the angular distance is

$\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 0

T30 and DRR are each represented as 20-dimensional third-octave-band vectors, and the paper reports median absolute error and Pearson correlation across frequency bins. Using 4-channel FOA converted into active and reactive components, with two parallel branches of 3D convolutional layers followed by a 3-layer MLP, the reported best test-set results are 6.60 median absolute error for 3D source localization, 0.43 m for distance estimation, 2.74 dB for DRR estimation, and 90.66 ms for T30 estimation (Sarabia et al., 2023).

Spatial features in later work become increasingly target-aware. "Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset" (Shao, 2023) defines a spatial feature as the similarity between target-dependent phase difference and observed inter-channel phase difference,

$\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 1

with $\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 2 extending the target description to azimuth, elevation, and distance. The paper’s key negative result is equally important: on MISP, reverberation badly distorts IPD, so the spatial feature can become unreliable even when the geometric model is more expressive than azimuth-only alternatives (Shao, 2023).

A related but operationally different feature appears in "SpatialEmb" (Shao et al., 25 Jan 2026). There the spatial cue is derived from a target speaker solo segment used as a proxy for the target speaker’s room impulse response or spatial signature. The RIR-convolved phase is defined as

$\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 3

and the spatial feature averages pairwise cosine phase differences across microphones. The paper states that this makes the spatial feature behave like a T-F mask indicating target-speaker dominance (Shao et al., 25 Jan 2026). This suggests a recurring design pattern in the field: successful features usually encode not generic geometry alone, but a similarity between observed multichannel structure and a target-conditioned spatial hypothesis.

4. Recognition and reasoning with spatial cues

One major direction uses spatial information directly inside recognition or general-purpose reasoning models rather than relegating it to front-end separation. "SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays" (Shao et al., 25 Jan 2026) makes this point explicitly. Instead of the usual pipeline of separation or beamforming followed by single-channel ASR, SpatialEmb introduces a lightweight embedding module before a Conformer-RNNT encoder. The overall pipeline is multichannel overlapped speech plus target solo segment, spectral and spatial feature extraction, feature fusion, SpatialEmb, standard Conformer, and pruned RNNT loss. No explicit separation output is supervised.

The paper studies fixed and arbitrary microphone topology. For fixed arrays it concatenates spectral and spatial features as $\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 4; for arbitrary arrays it compares spectral squeezing with spatial expansion and favors the latter. The preferred arbitrary-topology mechanism is DAC, a parameter-free simplification of TAC, defined by

$\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 5

On AliMeeting, the best model trained with 105 hours Train-Ali-far achieves 17.04% and 20.32% character error rates on the Eval and Test sets, establishing a new state-of-the-art result with the same training data (Shao et al., 25 Jan 2026).

Large-model reasoning over spatial audio extends this logic. "Can LLMs Understand Spatial Audio?" (Tang et al., 2024) combines a Whisper-large-v3 encoder, intensity vectors extracted from FOA audio, a window-level Q-Former, Vicuna-7B-v1.5, and LoRA. Tasks are posed as question-answering: localization asks for azimuth or elevation, far-field speech recognition asks for transcription, and localisation-informed speech extraction asks for transcription of the speech coming from a specified direction. With IVs fused before the Q-Former, the model achieves $\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 6 MAE on Spatial LibriSpeech, improving on the prior benchmark of about $\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 7, while spatial cues also improve far-field recognition from 9.0 WER to 7.6 and make direction-selective extraction feasible (Tang et al., 2024).

The MISP results show the limits of this paradigm in highly reverberant real data. Using a 12-layer Conformer encoder with a pruned RNN-T backend and 40-bin log Mel filterbanks, the 3D spatial feature improves far-field audio plus video from 73.25% CER to 64.37% CER, but the paper emphasizes that the gain is not as large as expected because the assumed match between theoretical phase delay and observed IPD breaks down under reflection-dominated conditions (Shao, 2023). A recurring lesson is therefore that one-stage spatially informed recognition is viable, but only to the extent that the feature remains physically stable under real propagation.

5. Enhancement, separation, and coding with spatial cue preservation

A second major direction seeks not merely to infer space, but to preserve or manipulate it during enhancement, separation, or compression. In multichannel extraction, "Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction" (Guo et al., 2023) combines a UNet-TCN pre-separation module with a neural beamforming module that uses covariance matrices, IPD, angle features, and multi-head cross-attention. The target-aware angle feature is

$\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 8

and the final input stacks magnitude, $\mathrm{AvgU} = \frac{1}{N}\sum_x U_x,$ 9, and $\mathrm{AvgU}$ 0 along the channel dimension. The paper argues that beamforming is inherently a spatial filtering problem and reports consistent gains from both UNet-TCN and spatial cross-attention over IRM-MVDR, GRNNBF, and SARNN (Guo et al., 2023).

Hybrid classical-neural systems make a related point from the opposite direction. "Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks" (Ni et al., 2020) uses MESSL for unsupervised spatial clustering, then trains bidirectional LSTMs to clean the masks before MVDR covariance estimation. On CHiME-3, the combined MESSL+LSTM system achieves PESQ 2.80 and WER 10.2, compared with PESQ 2.19 and WER 13.6 for BeamformIt, illustrating the logic summarized in the paper as better masks leading to better covariances and hence a better MVDR filter (Ni et al., 2020).

End-to-end cue preservation is particularly visible in binaural and stereo systems. "Real-time binaural speech separation with preserved spatial cues" (Han et al., 2020) extends TasNet to a true MIMO binaural separator trained against HRIR-rendered targets and evaluated with $\mathrm{AvgU}$ 1 and $\mathrm{AvgU}$ 2. Its strongest model, parallel encoder plus mask-and-sum, reports SNRi 15.6 dB, $\mathrm{AvgU}$ 3, and $\mathrm{AvgU}$ 4 dB on anechoic spatialized WSJ0-2mix, while remaining causal with minimum latency below 5 ms (Han et al., 2020). "Real-time Stereo Speech Enhancement with Spatial-Cue Preservation based on Dual-Path Structure" (Togami et al., 2024) uses two source-specific paths, adaptive DSBF steering-vector updates, and a pretrained monaural PercepNet with source-specific common-band gain, so that each source is enhanced in a source-specific spatial context and then remapped into stereo before summation (Togami et al., 2024).

Compression introduces the same constraint. "SpatialCodec: Neural Spatial Speech Coding" (Xu et al., 2023) separates the problem into a reference-channel neural sub-band codec and a spatial branch that codes relative inter-channel structure through spatial covariance matrices and complex ratio filters. The paper’s beamspace-based spatial similarity metric evaluates whether reconstructed multichannel audio preserves the directional energy pattern, and the reported result is that SpatialCodec at 12 kbps significantly outperforms much higher bitrate baselines, including 96 kbps OPUS12, in spatial metrics (Xu et al., 2023).

Complex-valued spatial filtering offers yet another formulation. "Complex-valued Spatial Autoencoders for Multichannel Speech Enhancement" (Halimeh et al., 2021) estimates complex-valued masks $\mathrm{AvgU}$ 5 and reconstructs the enhanced signal through

$\mathrm{AvgU}$ 6

Because the masks manipulate both amplitude and phase, the method can align phases across microphones and implement a learned spatio-spectral filter-and-sum structure. On synthetic data with one desired speech source, one noise source, and one music source, COSPA reports 7.5 dB $\mathrm{AvgU}$ 7SINR, 5.3 dB SDR, 0.23 $\mathrm{AvgU}$ 8PESQ, and 0.09 $\mathrm{AvgU}$ 9STOI, with a learned beampattern that the paper characterizes as physically plausible spatial selectivity (Halimeh et al., 2021).

6. Synthesis, communication, and embodied grounding

On the generative side, Speech-to-Spatial includes systems that preserve or impose spatial context during synthesis. "Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals" (Seki et al., 2024) defines an ideal output in which only the target voice is converted while the same room transfer functions and non-target signals are preserved:

$\mathrm{AvgU}$ 0

Its baseline pipeline of GC-IVA-based separation, DDSP-SVC, and spatial remixing identifies a central trade-off: inverse remixing better preserves spatial information, while steering-vector remixing better preserves naturalness. The paper treats this as the fundamental difficulty of spatial VC (Seki et al., 2024).

Scene-aware speech generation appears in "I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception" (Zhang et al., 2024). According to the abstract, the model introduces a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline and a reverberation classification and refinement technique that adjusts the synthesized mel-spectrogram so that the reverberation condition matches the scene. The stated goal is high-quality scene and spatial matching without compromising speech naturalness, particularly for gaming and virtual reality (Zhang et al., 2024).

A closely related communication problem is explicit spatial scene reconstruction. "Directional MCLP Analysis and Reconstruction for Spatial Speech Communication" (Chetupalli et al., 2021) defines spatial speech communication as reconstruction of spoken signal along with the relative speaker position in the enclosure. At the transmitter, directional MCLP separates direct and diffuse components, SRP-PHAT estimates DoA at distributed nodes, the source position is obtained from intersecting DoA directions, and the node nearest the source is selected for transmission. At the receiver, VBAP renders the direct component over a four-loudspeaker setup while the diffuse component is decorrelated and reproduced equally from all loudspeakers; binaural playback is obtained by HRIR convolution of the loudspeaker signals (Chetupalli et al., 2021).

The concept also extends beyond acoustic rendering into embodied interfaces. "From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality" (Kim et al., 3 Feb 2026) treats speech-only remote-assistance utterances as referent disambiguation problems over an object-centric relational graph. The system characterizes recurring patterns—Direct Attribute, Relational, Remembrance, and Chained—parses utterances with Whisper and GPT-4.1, filters visible objects, ranks the top five candidate nodes, and renders a persistent in-situ AR arrow plus text guidance. In a user study with 18 participants and 1,296 total trials, the Summary condition reduces completion time relative to Audio for both Locate and Move tasks, improves Move accuracy, and lowers cognitive workload (Kim et al., 3 Feb 2026).

A broader robotic analogue appears in "Spatial Concept-Based Navigation with Human Speech Instructions via Probabilistic Inference on Bayesian Generative Model" (Taniguchi et al., 2020). There, place names are grounded not in a single coordinate but in learned spatial concepts, and planning is formulated as maximizing the posterior over trajectories under speech instruction:

$\mathrm{AvgU}$ 1

This suggests that, in embodied settings, Speech-to-Spatial may refer not only to audio spatialization but also to speech-conditioned inference over referents, places, and action trajectories (Taniguchi et al., 2020).

7. Recurrent challenges and research directions

Several technical tensions recur across the literature. First, spatial information is multifaceted. It includes speaker identity structure in latent spaces, source direction, distance, room acoustics, and interactive referential structure, so methods optimized for one notion of “space” do not automatically generalize to another (Riera et al., 2023, Sarabia et al., 2023).

Second, spatial robustness is strongly limited by propagation effects. The MISP study shows that reverberation can corrupt IPD enough to break the match between target-dependent phase models and observations, while Spatial VC reports that both inverse and steering-based remixing degrade as reverberation increases (Shao, 2023, Seki et al., 2024). This suggests that robust Speech-to-Spatial systems require representations whose physical assumptions remain stable under reflection, scattering, and topology mismatch.

Third, preserving spatial fidelity is often in tension with other desiderata. In representation learning, phone clustering and speaker clustering trade off against one another (Riera et al., 2023). In spatial VC, naturalness and spatial fidelity trade off across remixing strategies (Seki et al., 2024). In AR grounding, concise summaries can improve efficiency while potentially omitting detail, whereas full transcriptions may better preserve nuance (Kim et al., 3 Feb 2026).

Fourth, direct end-to-end integration is increasingly preferred over long preprocessing pipelines, but it is not uniformly superior. SpatialEmb argues that separation followed by single-channel ASR is inefficient and suffers error propagation, whereas the MISP results show that stronger inductive bias can still outperform more complex raw complex-input alternatives in reverberant settings (Shao et al., 25 Jan 2026, Shao, 2023). A plausible implication is that future progress will depend less on removing structure and more on choosing the right physically informed structure.

Finally, the field is moving toward systems that are simultaneously spatially aware, task-aware, and modality-aware. LLM-based spatial audio reasoning, scene-aware immersive TTS, spatially preserved enhancement, AR referent grounding, and speech-conditioned navigation all point to the same broader agenda: speech should be modeled as a signal embedded in physical, geometric, and interactive space rather than as isolated content (Tang et al., 2024, Zhang et al., 2024, Kim et al., 3 Feb 2026).