Papers
Topics
Authors
Recent
2000 character limit reached

Listening-while-Speaking Language Model (LSLM)

Updated 2 December 2025
  • LSLM is a speech-driven neural system designed for continuous, interactive spoken communication by interleaving listening and speaking in real time.
  • It employs multi-stream architectures and chain-of-thought reasoning to optimize timing, interruption handling, and latency–accuracy trade-offs.
  • Empirical evaluations reveal rapid response times, robust interruption detection, and enhanced performance under noisy and dynamic conditions.

A Listening-while-Speaking LLM (LSLM) is a speech-driven neural system designed for full-duplex, real-time spoken interaction, integrating both continual perception of user audio and immediate response generation. Unlike turn-based dialogue systems, LSLMs fuse listening and speaking by interleaving input streams, controlling output timing, and supporting interruption, which yields human-like responsiveness and robust interaction even under challenging reasoning or latency constraints. Recent frameworks implement LSLMs using sophisticated multi-stream architectures, chain-of-thought reasoning triggers, preference-driven optimization strategies, and explicit mechanisms for policy adaptation, interruption detection, and streaming fusion.

1. Fundamental Architectures in Listening-while-Speaking LMs

LSLMs employ multi-stream or unified autoregressive designs to process input and output concurrently. For speech reasoning, a canonical architecture utilizes streams for user audio tokens (AtUA^{U}_t), system audio tokens (AtSA^{S}_t), and system text tokens (TtST^{S}_t), all tightly time-aligned. At each timestep, the streams are updated as follows:

  • AtU,AtS,TtSTemporal TransformerTt+1SA^{U}_t, A^{S}_t, T^{S}_t \xrightarrow{\text{Temporal Transformer}} T^{S}_{t+1}
  • Tt+1SDepth TransformerAt+1ST^{S}_{t+1} \xrightarrow{\text{Depth Transformer}} A^{S}_{t+1}

The model maximizes p(At+1S,Tt+1SAtS,TtS,AtU)p(A^{S}_{t+1}, T^{S}_{t+1} | A^{S}_{\leq t}, T^{S}_{\leq t}, A^{U}_{\leq t}) via negative log-likelihood over all streams. Text tokens are padded and interleaved such that audio and monologue channels remain fully revisable mid-utterance.

Alternative approaches include speech-to-speech LLMs with implicit chain-of-thought (ICoT) internalization that gradually drop explicit transcription steps in training, as described for A-T-A systems, compressing ASR reasoning into latent model states (Yuen et al., 25 Sep 2024). End-to-end designs may also fuse streaming self-supervised encoders for live audio with decoder-only TTS blocks and integrate both channels at multiple points (early/middle/late fusion) for robust interruption handling (Ma et al., 5 Aug 2024). Modular full-duplex systems coordinate LLMs with neural finite state machines (FSM), streaming ASR, and TTS, presenting interaction as next-token autoregression on a serialized tape (Wang et al., 29 May 2024).

2. Reasoning and Timing: Chain-of-Thought and Question Completeness

Complex spoken reasoning in LSLMs leverages chain-of-thought (CoT) methodologies. Systems are fine-tuned on triplets (QA, RT, AA)(Q^A,\ R^T,\ A^A), where QTQ^T (transcribed question) precedes RTR^T (reasoning, bracketed by <start_cot> / <end_cot>), and ATA^T (answer). Standard next-token cross-entropy is applied:

LSFT=tlogπθ(utu<t)L_{\rm SFT} = -\sum_t \log \pi_\theta(u_t|u_{<t})

yielding substantial accuracy boosts (2.4× baseline on reasoning tasks, e.g., ARC-E from 30.2% to 77.7%) (Shih et al., 8 Oct 2025).

To reduce latency, LSLMs implement semantic triggers for early reasoning via question completeness ζ(p)\zeta(p):

ζ(p)=1DKL[XNXp]/DKL[XNX0]\zeta(p) = 1 - D_{KL}[X_N \| X_p] / D_{KL}[X_N \| X_0]

where XpX_p denotes the distribution over reasoning and answer given only the first pp words of the question. A threshold θ\theta sets the inflection point, enabling reasoning to begin before the end of spoken query. Entropy proxies serve as simpler alternatives but are less robust. This mechanism yields fine-grained latency–accuracy trade-offs and traces out convex Pareto frontiers.

3. Policy-Making and Simultaneous Generation

LSLMs incorporate explicit policy-makers to optimize when to emit responses. In simultaneous generation settings, LLM-driven frameworks like LSG prompt the LLM to decide action {READ,WRITE}\{READ,\,WRITE\} at each time step. The core policy improvement relies on comparing KL divergence between current and baseline next-token distributions:

ΔKL=DKL(pcurpbase)\Delta_{KL} = D_{KL}(p_{\mathrm{cur}}\|p_{\mathrm{base}})

where writing is triggered if ΔKL>δ\Delta_{KL} > \delta or if model confidence exceeds α\alpha. This approach achieves state-of-the-art latency–quality trade-offs and does not require offline policy-module training (Guo et al., 1 Jan 2025).

Full-duplex LSLMs implement FSM-driven control tokens for responsive behaviors: [S.SPEAK], [C.SPEAK], [S.LISTEN], [C.LISTEN]. Each step involves maximizing either control-token or content-token probabilities,

xt=argmaxxP(xx<t,at)x_t^* = \arg\max_x P(x|x_{<t},a_{\leq t})

with transition function δ(s,c)\delta(s,c) explicitly formalized.

4. Streaming Fusion and Interruption Handling

LSLMs realize simultaneous listening and speaking via streaming fusion approaches. Middle-fusion, injecting listening embeddings into each Transformer layer, shows optimal performance, preserving speech synthesis (WER near baseline) while providing rapid, precise interruption detection (precision/recall/F1 often 97%\geq97\% under noise) (Ma et al., 5 Aug 2024). Early fusion corrupts generation quality, while late fusion yields less robust interruption boundaries.

Interruption handling is enacted via special vocabulary tokens (e.g., IRQ), with the loss function:

LLS(θ)={t=1tIRQlogPθ(rtqR1:t1q,S1:t1p,C)(with interruption) t=1TEOSlogPθ(rtqR1:t1q,S1:t1p,C)(no interruption)\mathcal{L}_{\mathrm{LS}}(\theta) = \begin{cases} -\sum_{t=1}^{t_{\mathrm{IRQ}}} \log P_\theta(r^q_t | R^q_{1:t-1}, S^p_{1:t-1}, C) & \text{(with interruption)} \ -\sum_{t=1}^{T_{\mathrm{EOS}}} \log P_\theta(r^q_t | R^q_{1:t-1}, S^p_{1:t-1}, C) & \text{(no interruption)} \end{cases}

Interruptions are detected within 0.5 s and model output ceases accordingly.

5. Optimization Strategies and Accuracy–Latency Trade-Offs

Direct Preference Optimization (DPO) extends LSLM fine-tuning to maximize the Pareto frontier for accuracy and latency. LSLMs sample contrastive pairs, preferring shorter or more accurate traces:

LDPO(πθ;πref)=E(x,yw,yl)[logσ(β[logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)])]L_{\rm DPO}(\pi_\theta; \pi_{\rm ref}) = -\,\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma \left( \beta \left[\log \frac{\pi_\theta(y_w|x)}{\pi_{\rm ref}(y_w|x)}- \log\frac{\pi_\theta(y_l|x)}{\pi_{\rm ref}(y_l|x)}\right]\right) \right]

Adding an NLL regularization on the preferred path stabilizes training.

Empirical findings show:

  • CoT fine-tuning: \sim2–3× gain on reasoning accuracy
  • Early reasoning (QC-based, θ=0.75\theta=0.75): 75%75\% latency reduction, 14%14\% drop in absolute accuracy
  • DPO for early CoT: restores 34%3-4\% accuracy with minimal latency cost
  • Length-based DPO: shrinks reasoning traces by \sim70% (e.g., 501550\to15 tokens) with no accuracy degradation (Shih et al., 8 Oct 2025)

6. Evaluation, Benchmarks, and Performance Metrics

LSLMs are evaluated under scenarios including ARC-E, ARC-C, SIQA, PIQA, GSM8K, LibriSpeech, and multi-agent social deduction environments:

  • Response latency: subsecond FTED, with >50%>50\% of responses under $500$ ms (Wang et al., 29 May 2024).
  • Interruption precision: LSLM achieves 8%8\% absolute gain over commercial models.
  • Duplexing robustness: models sustain WER close to vanilla TTS even under heavy noise, and precise turn-taking (F1 up to 98%98\% in controlled settings) (Ma et al., 5 Aug 2024).
  • Multi-agent games: listening-while-speaking agents double win rates compared to policy-only RL baselines, producing human-like grounded discussions and accurate hidden-state inference (Sarkar et al., 9 Feb 2025).

7. Limitations, Open Questions, and Future Directions

Current limitations include:

  • Model generalization under real-world accents and background noise.
  • Turn-taking beyond stop events, overlapping talkers, and multi-modal interruptions.
  • Computation overhead at inference, scaling for very large LLMs (quantization and sparse prediction are plausible future remedies) (Deng et al., 21 Dec 2024).

Open directions include integrating domain-specific adapters for speech or dialogue, extending LSLM capabilities to continuous multi-turn, cross-lingual, and multimodal (e.g., audio-visual) settings, and refining policy and preference learning for richer simultaneous interaction. Empirical extension to real recordings, low-resource scenarios, and rapid resetting for dynamic dialogue segmentation remain essential avenues.


LSLMs represent the foundational shift toward continuous, responsive, and reasoning-capable spoken language systems, leveraging multi-stream architectures, explicit reasoning triggers, streaming fusion for interruption resilience, and advanced optimization techniques to balance accuracy and latency for real-world deployment (Shih et al., 8 Oct 2025, Yuen et al., 25 Sep 2024, Ma et al., 5 Aug 2024, Wang et al., 29 May 2024, Guo et al., 1 Jan 2025, Deng et al., 21 Dec 2024, Sarkar et al., 9 Feb 2025, Novitasari et al., 2020).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Listening-while-Speaking Language Model (LSLM).