Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpeechMapper: Dual Speech Processing

Updated 29 January 2026
  • SpeechMapper is a dual-framework system that combines spatial speech recognition mapping with speech-to-LLM embedding alignment for high-fidelity speech processing.
  • The spatial mapper uses simulation-based tools—including acoustic rendering, hearing device modeling, and listener simulation—to generate detailed spatial maps of speech intelligibility.
  • The embedding mapper employs a two-stage training process with a frozen speech encoder and Transformer-based projector blocks to achieve scalable, efficient speech-to-text integration.

SpeechMapper encompasses two distinct frameworks at the intersection of speech processing, auditory neuroscience, and machine learning. The first, introduced by Kollmeier et al. (2021), realizes interactive spatial maps of speech recognition performance in complex acoustic scenes. The second, established by Wang et al. (2026), proposes an efficient method for speech-to-text embedding alignment for LLMs, aimed at computationally scalable and robust speech-LLM integration. This article presents an integrated, technical account of both SpeechMapper lineages, addressing their core architectures, mathematical foundations, practical workflows, empirical findings, and current limitations.

1. Modular System Architectures

Spatial Speech Recognition Mapper

SpeechMapper for spatial speech recognition is a simulation-based toolchain for modeling and visualizing how environmental acoustics, signal processing (e.g., hearing aids), and listener attributes jointly shape speech intelligibility. Its modular pipeline comprises:

  • Acoustic Rendering Model (TASCAR): Simulates a virtual environment (e.g., living room with spatialized TV, appliances) and outputs binaural impulse responses for discrete source-receiver-angle tuples. It generates (i) binaural noise recordings for maskers and (ii) binaural head-related impulse responses (HRIR) to convolve clean speech.
  • Hearing Device Model (openMHA): Processes simulated speech plus noise, incorporating gain prescription (e.g., NAL-NL2) and multiband dynamic compression, thereby altering signal-to-noise ratio (SNR) and spectral cues.
  • Listener Model (FADE with KAIN): Extracts log-Mel representations with simulated hearing-loss thresholds and stochastic uncertainty, computes binaural features, and predicts word-correct rates via a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) automatic speech recognition (ASR) back end.

This workflow produces dense grids of spatially-resolved speech reception thresholds (SRT50, the SNR delivering 50% word-correct), supporting real-time introspection and interactive visualization (Schädler, 2021).

Speech-to-LLM Embedding Mapper

The SpeechMapper architecture for LLM integration transforms spoken input into LLM-compatible embeddings via two-stage training:

  • Frozen SFM Encoder: Input speech is embedded by a fixed speech foundation model (SeamlessM4T-v2-large), yielding frame-level feature sequences.
  • Projector Blocks: Stacked, two-stage blocks (each comprising 1D convolutional down-sampling, six Transformer encoder layers, and an FC expansion) produce high-dimensional vectors matching the LLM embedding space (4096 dimensions).
  • Embedding Injection: During inference, SpeechMapper embeddings are concatenated with standard LLM token embeddings and directly injected into frozen LLMs such as Llama-3.1-8B-Instruct, enabling seamless speech-to-text mapping (Mohapatra et al., 28 Jan 2026).

2. Mathematical and Computational Foundations

Virtual Acoustic Simulation

  • Convolutional Embedding: Clean speech s(t)s(t) is rendered binaurally via convolution with HRIRs: sc(t)=hc(t)s(t)s_c(t) = h_c(t) * s(t) for channels c{L,R}c \in \{L, R\}.
  • Ear-Channel SNR: Defined as SNRc=10log10(sc2/nc2)SNR_c=10\log_{10} \left(\int |s_c|^2 / \int |n_c|^2 \right), with better-ear listening using SNRBE=max{SNRL,SNRR}SNR_{BE} = \max\{SNR_L, SNR_R\}.
  • Hearing Device Transfer Function: For band ii, Lout,i=αiLin+βiL_{out,i} = \alpha_i L_{in} + \beta_i, with gains set according to prescribed targets Gi=fNALNL2(θi)G_i = f_{NAL-NL2}(\theta_i).

ASR-Derived Intelligibility Prediction

  • Feature Modification: For Mel-band kk, frame nn: Xc(k,n)=max{Xc(k,n),Tk}+ϵ(k,n)X'_c(k,n)=\max\{X_c(k,n), T_k\} + \epsilon(k,n), ϵN(0,σk2)\epsilon \sim \mathcal{N}(0, \sigma_k^2), models floor/uncertainty via simulated hearing loss.
  • Psychometric Function: Word-correct rate as P(L)=1/[1+exp((LL50)/s)]P(L) = 1/[1 + \exp(-(L - L_{50})/s)]; L50L_{50} interpolated such that P(L50)=0.5P(L_{50}) = 0.5.
  • Principal Metrics: SRT50 is the key output, mapped onto spatial grids for comprehensive visualization.

Speech-to-LLM Embedding Matching

  • Pretraining Loss (Stage 1): Combines mean-squared error (MSE) over target word embeddings and pad tokens,

LMSE=αMSEword+(10α)MSEpad;L_{MSE} = \alpha \cdot MSE_{word} + (10-\alpha) \cdot MSE_{pad};

with an additive cosine-similarity loss,

Lstage1=LMSEγLcosine,L_{stage1} = L_{MSE} - \gamma L_{cosine},

for α=5\alpha=5, γ=100\gamma=100.

  • Instruction Tuning Loss (Stage 2): Task-agnostic variant uses LCEL_{CE} or a hybrid LCE+σLMSEL_{CE} + \sigma L_{MSE}, where σ=0.9\sigma=0.9 (Eq.3). Task-specific IT uses LCEL_{CE} only.
  • Embedding Injection: Output vectors f(x)=Wx+bf(x) = W x + b, WR4096×4096W \in \mathbb{R}^{4096 \times 4096}, are directly appended to the LLM input sequence.

3. Data Generation and Visualization Workflows

Parameter Sweeps and Assembly

  • Discrete sampling of head azimuth (θ{90,45,0,45,90}\theta \in \{-90^\circ, -45^\circ, 0^\circ, 45^\circ, 90^\circ\}), 48-grid talker positions (0.5 m mesh), 4 masking states (TV and door on/off), 3 listener profiles: yields 5×48×4×3=28805 \times 48 \times 4 \times 3 = 2880 SRT predictions per condition.
  • Nearest-neighbor interpolation visualizes precomputed matrices M(i,j)M(i,j) of SRT50 values on a spatial map, preserving original sampling resolution. Color-binned heatmaps (12 bins, 3.3 dB per bin) enable intuitive assessment of "communication horizon" under varied scenes, hearing impairments, and device states.

Interactive Interface and Deployment

  • User may re-orient head (select azimuth), toggle maskers, or switch listener profiles; GUI loads and displays context-matched SRT maps.
  • GNU/Octave-based implementation leverages callback-driven refresh; alternative web-based implementations could employ JavaScript/WebGL or HTML5 Canvas, enabling GPU-accelerated map switching and real-time user interaction.

4. Empirical Performance, Applications, and Use Cases

SpeechMapper (Speech-LLM Integration)

  • Speech Translation (ST): On EuroParlST (en→{es, fr, de, it}) and CoVoST2 datasets, zero-shot ASR CE+MSE training achieves COMET differences approaching specialist models, despite never seeing task data. Task-specific IT matches or slightly outperforms in-domain models.
  • Spoken Question Answering (SQA): On SpokenSQuAD and LibriSQA, task-agnostic CE+MSE closes significant accuracy gaps (up to 13 points) compared to baselines, matching or exceeding specialist models after only 1K adaptation steps.
  • Efficiency: Stage 1 pretraining (on 4×V100-32GB, 2M steps) and Stage 2 (on 1×A100-80GB, 1.5h IT) reduce compute cost by ≈80% relative to multitask full-LLM instruction tuning baselines.

Spatial SRT Mapping

  • Clinical and Architectural Guidance: Enables audiologists, engineers, and architects to explore impacts of sound sources, head orientation, and assistive devices, facilitating optimal space and device design.
  • Example Scenarios: Maskers (TV, door) induce spatially-localized SRT elevation; impaired unaided listeners exhibit uniformly impaired SRT. Aided listeners gain ≈6 dB benefit yet remain less performant than normal listeners.
  • Device Design: Engineers may overlay beamformer patterns; clinicians may visualize device fitting effects in ecologically valid contexts.

5. Limitations and Prospective Developments

Speech-to-LLM Mapper

  • Named Entity Mapping: Embedding-space mapping confuses or omits unseen entity substrings.
  • Alignment-Free Training: Causes synonym swaps, pronoun shifts, and repetition.
  • Metric Limitations: Word error rate (WER) and character error rate (CER) can overpenalize correct paraphrasing by LLMs.
  • Future Directions: Incorporate alignment mechanisms (e.g., lightweight ASR CTC loss), synthetic entity inventories, or alternative paraphrase-robust evaluation metrics (Mohapatra et al., 28 Jan 2026).

Spatial Speech Recognition Mapper

  • Prediction Model Accuracy: Suitability for serious applications remains subject to further validation.
  • Ecological Validity: Simulations provide detailed hypotheses but lack real-world, user-centered outcome validation at scale.

6. Relationship to Other Work

  • For anatomical and biomechanical mapping, "Speech Map" should not be confused with SpeechMapper. The earlier "Speech Map" statistical multimodal atlas computes 4D tongue kinematics from MRI using groupwise registration, incompressible registration algorithms, and PCA for motion decomposition (Woo et al., 2017). That framework is distinct from the architectural and embedding-centric designs described above.
  • Both lines exemplify the trend toward high-dimensional, quantitatively grounded representations for speech: simulation-driven intelligibility landscapes and embedding alignment for robust, generalizable audio-LLMs.

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpeechMapper.