Acoustic-Semantic Gap in SLLMs: Overview

Updated 19 November 2025

Acoustic-Semantic Gap in SLLMs is the disparity caused by long, noisy speech tokens that reduce semantic clarity compared to concise text tokens.
Scaling analyses reveal that SLLMs require up to 500× more compute and suffer up to a 20% drop in semantic benchmarks due to variable sequence lengths and paralinguistic complexity.
Advanced architectures like HASRD, SAC, and XY-Tokenizer use dual-stream and disentanglement techniques to separate semantic and acoustic features, mitigating the performance gap.

The acoustic–semantic gap in spoken language LLMs (SLLMs) denotes a persistent performance and representational disparity between models processing text and those handling speech. Text-based LLMs routinely achieve high semantic coherence and linguistic generalization, while SLLMs struggle to reliably model and generate semantically rich, contextually appropriate outputs. The gap is attributed to modality-specific constraints: speech tokens are typically longer sequences carrying substantial paralinguistic or phonetic information but much less explicit semantics; duration and prosody add variability and complexity not present in text; and paralinguistic cues often overwhelm cross-entropy objectives focused on lexical content. Recent work has systematically defined, measured, and programmed remedies for these limitations, leading to a deeper understanding of how the acoustic–semantic gap emerges and how it may be mitigated.

1. Formalization and Measurement of the Acoustic–Semantic Gap

Speech language modeling tasks are centered on maximizing the log-likelihood of token sequences, $\mathcal{L}=-\sum_{i=1}^n\log P(t_i \mid t_{1:i-1})$ , whether the tokens are text, phonemes, or speech-derived cluster indices (Wang et al., 22 Dec 2024). Semantic coherence is typically measured via zero-shot accuracy on benchmarks such as Topic-Story Cloze, where random guess yields 50% and high semantic performance approaches 74%. The gap is operationalized by relative performance drops as modalities transition from text to phones to speech along several axes: lexical, syntactic, and semantic tasks, and free generation assessed by perplexity.

Scaling analyses model task accuracy $y$ by a power law in compute $C$ : $y = b' C^{k'}$ , with $k'$ quantifying scaling efficiency. SLMs require up to $500\times$ more compute than text LLMs to match semantic benchmarks, reflecting a severe gap in scaling properties (Cuervo et al., 31 Mar 2024). Furthermore, ablation studies quantify loss of semantic information as input representations shift from text-BPE to speech-derived tokens (HuBERT), with semantic accuracy dropping as much as 20 percentage points and perplexity tripling (Wang et al., 22 Dec 2024).

2. Causes and Modality-Evolving Analysis

The modality-evolving framework (Wang et al., 22 Dec 2024) isolates three key factors:

A. Phonetic vs. Semantic Information Content: Transition from text-based BPEs to phone-based BPEs reveals only a minor degradation (semantic accuracy drops ~4%). Phonetic units still track subword structure, but lack full lexical meaning.

B. Sequence Length (Duration Variability): Moving from raw phones (10 tokens/s) to repeated phones at frame rates of 50 tokens/s results in major semantic and syntactic degradation. Models exposed to long, redundant sequences fail to efficiently aggregate semantic units, and task accuracy plunges by 12% or more.

C. Paralinguistic Complexity: Introducing speech tokens that preserve prosody and non-lexical variation (e.g., HuBERT clusters at 50 Hz) compounds the issue—lexical task accuracy collapses (40% drop), syntactic accuracy drops by 13%, semantic by almost 10%, and generation perplexity increases by well over 100%. Early-layer analysis shows speech-token models cannot form robust lexical representations, stalling higher-level syntactic and semantic inference.

3. Manifestations in Large Audio LLMs

Recent diagnostic benchmarks such as LISTEN (Chen et al., 12 Oct 2025) empirically validate the lexical dominance of modern SLLMs. Across Text-only, Audio-only, and Text+Audio conditions for emotion understanding, models consistently fall back on textual content:

In Neutral-Text, text-only accuracy is 85-97%, audio-only collapses to 16-35%, and adding audio yields no gain.
For Emotion-Mismatched, paralinguistic accuracy hovers at chance levels; Text+Audio fails to resolve fine-grained emotional cues.
In fully paralinguistic settings, most models (including Gemini 2.5 Pro) score only 16-23%, barely above uniform guess.

Confusion matrices reveal that SLLMs “transcribe” but do not “listen”: strong emotional prosody in audio is routinely ignored when lexical cues are neutral or absent, underscoring a large acoustic–semantic gap. The authors recommend explicit speech-emotion recognition modules and tuned fine-grained objectives to penalize overreliance on transcripts.

4. Architectural Solutions: Disentanglement and Codec Design

Novel frameworks such as HASRD (Hussein et al., 1 Jun 2025), SAC (Chen et al., 19 Oct 2025), and XY-Tokenizer (Gong et al., 29 Jun 2025) employ explicit disentanglement between semantic and acoustic information:

Architecture	Semantic–Acoustic Separation	Key Disentanglement Mechanism
HASRD	Hierarchical codebooks	First codebook for semantics; residual RVQ for acoustics
SAC	Dual-stream quantization	Frozen semantic encoder, acoustic encoder trained for waveform fidelity
XY-Tokenizer	Dual-channel encoder/decoder	Whisper-based semantic tower, Vocos-based acoustic tower, shared RVQ

HASRD dedicates the first codebook to maximizing ASR performance, and all residual codebooks to acoustic fidelity, achieving WER reductions of 44–66% and doubling bitrate efficiency (Hussein et al., 1 Jun 2025). SAC freezes the semantic stream and separately quantizes acoustics, ensuring each pathway is solely responsible for its respective content; semantic-only decoding retains intelligibility but loses speaker identity, and acoustic-only decoding produces timbre but no semantics (Chen et al., 19 Oct 2025). XY-Tokenizer leverages strong ASR cross-entropy supervision, staged multi-task learning, and minimal parameter sharing for state-of-the-art balance in both WER and speaker similarity (Gong et al., 29 Jun 2025).

5. Scaling Laws, Data, and Tokenization Strategies

Text-based LLMs and SLMs exhibit divergent scaling exponents for semantic tasks: $\gamma_q^\mathrm{text}=0.046$ (Story Cloze) vs. $\gamma_q^\mathrm{speech}=0.017$ , reflecting a $500\times$ scaling gap in required compute (Cuervo et al., 31 Mar 2024). Synthetic semantic data (sTinyStories corpus) boosts semantic scores by $\sim$ 5–8% but does not erase the gap (Cuervo et al., 31 Mar 2024). Tokenization granularity plays a complex role: coarse subword compression slightly improves upstream loss scaling but hinders downstream semantic scaling.

Progressive Down-Sampling (PDS) offers a practical bridge, compressing acoustic features into text-like units aligned with subword sequences, yielding up to 0.75% WER reduction and 1.5× decoding speedups while facilitating sharper attention and contextual encoding (Xu et al., 2023).

6. Probing, Benchmarks, and Layer-wise Analysis

Layer-wise minimal pair probing establishes that Transformer SLLMs robustly encode grammatical and syntactic structure (accuracy $\approx$ 90–95%), but lag significantly for conceptual/semantic features ( $\approx$ 60%), with a persistent gap of 30 percentage points (He et al., 19 Sep 2025). This holds across self-supervised, ASR, AudioLLM, and codec architectures. Temporal analyses show syntactic information peaks before the critical word onset, while semantic cues remain flatter and less recoverable.

Human–model comparisons (SAGI Benchmark) highlight that humans maintain near-ceiling accuracy for paralinguistic and semantic tasks, but SLLMs underperform sharply in pitch, volume, and non-textual perception tasks (Bu et al., 17 Oct 2024). Whisper’s encoder, as an example, loses emotional, gender, and ambient distinctions in its latent space, as evidenced by high representation similarity ( $s\sim0.86$ –$0.95$ for 5–30s utterances with varied paralinguistics).

7. Pathways to Closing the Gap

Remedies supported by empirical and architectural evidence include:

Adoption of variable-length, low-frame-rate speech tokenization akin to Phone-BPE, optimizing semantic density without sequence length explosion (Wang et al., 22 Dec 2024).
Incorporation of explicit lexical-level semantic supervision: time-aligned word boundaries and semantic unit labels (Wang et al., 22 Dec 2024).
Dual-stream disentanglement in codecs and encoders, preserving dedicated pathways for semantic and acoustic objectives (Chen et al., 19 Oct 2025, Hussein et al., 1 Jun 2025).
Joint modeling approaches (e.g., Flow-SLM’s conditional flow-matching of continuous acoustic frames with semantic tokens) (Chou et al., 12 Aug 2025).
Enhanced training data, including paralinguistic and specialist acoustic corpora; stronger multimodal backbones with combined text–audio co-training (Bu et al., 17 Oct 2024).
Layer-specific semantic supervision and inclusion of concept-level tasks in pretraining objectives (He et al., 19 Sep 2025).

Persistent limitations include the inability of SLLMs to integrate paralinguistic, emotional, and abstract acoustic knowledge, brittle instruction-following, and modest semantic performance even at considerable compute and data scale.

8. Summary

The acoustic–semantic gap in SLLMs is a multi-factorial phenomenon rooted in the structural and statistical properties of speech representations. It reflects the difficulty of extracting abstract semantic meaning from long, paralinguistically complex, frame-based input tokens, especially when compared to concise, symbolically encoded text. The gap quantifies as a substantial loss in semantic coherence, efficiency, and generalization—visible in scaling laws, benchmarking, probing, and architectural ablations. Mitigation requires advances in speech tokenization, semantic supervision, codec design, joint modeling, and specialized pretraining. The state of the art demonstrates steady progress, but the gap remains a central challenge in building robust and generalizable spoken LLMs (Wang et al., 22 Dec 2024, Chen et al., 12 Oct 2025, Hussein et al., 1 Jun 2025, Chen et al., 19 Oct 2025, Cuervo et al., 31 Mar 2024, Gong et al., 29 Jun 2025, Xu et al., 2023, Bu et al., 17 Oct 2024, He et al., 19 Sep 2025, Chou et al., 12 Aug 2025).