Listening-while-Speaking Language Model (LSLM)

Updated 2 December 2025

LSLM is a speech-driven neural system designed for continuous, interactive spoken communication by interleaving listening and speaking in real time.
It employs multi-stream architectures and chain-of-thought reasoning to optimize timing, interruption handling, and latency–accuracy trade-offs.
Empirical evaluations reveal rapid response times, robust interruption detection, and enhanced performance under noisy and dynamic conditions.

A Listening-while-Speaking LLM (LSLM) is a speech-driven neural system designed for full-duplex, real-time spoken interaction, integrating both continual perception of user audio and immediate response generation. Unlike turn-based dialogue systems, LSLMs fuse listening and speaking by interleaving input streams, controlling output timing, and supporting interruption, which yields human-like responsiveness and robust interaction even under challenging reasoning or latency constraints. Recent frameworks implement LSLMs using sophisticated multi-stream architectures, chain-of-thought reasoning triggers, preference-driven optimization strategies, and explicit mechanisms for policy adaptation, interruption detection, and streaming fusion.

1. Fundamental Architectures in Listening-while-Speaking LMs

LSLMs employ multi-stream or unified autoregressive designs to process input and output concurrently. For speech reasoning, a canonical architecture utilizes streams for user audio tokens ( $A^{U}_t$ ), system audio tokens ( $A^{S}_t$ ), and system text tokens ( $T^{S}_t$ ), all tightly time-aligned. At each timestep, the streams are updated as follows:

$A^{U}_t, A^{S}_t, T^{S}_t \xrightarrow{\text{Temporal Transformer}} T^{S}_{t+1}$
$T^{S}_{t+1} \xrightarrow{\text{Depth Transformer}} A^{S}_{t+1}$

The model maximizes $p(A^{S}_{t+1}, T^{S}_{t+1} | A^{S}_{\leq t}, T^{S}_{\leq t}, A^{U}_{\leq t})$ via negative log-likelihood over all streams. Text tokens are padded and interleaved such that audio and monologue channels remain fully revisable mid-utterance.

Alternative approaches include speech-to-speech LLMs with implicit chain-of-thought (ICoT) internalization that gradually drop explicit transcription steps in training, as described for A-T-A systems, compressing ASR reasoning into latent model states (Yuen et al., 25 Sep 2024). End-to-end designs may also fuse streaming self-supervised encoders for live audio with decoder-only TTS blocks and integrate both channels at multiple points (early/middle/late fusion) for robust interruption handling (Ma et al., 5 Aug 2024). Modular full-duplex systems coordinate LLMs with neural finite state machines (FSM), streaming ASR, and TTS, presenting interaction as next-token autoregression on a serialized tape (Wang et al., 29 May 2024).

2. Reasoning and Timing: Chain-of-Thought and Question Completeness

Complex spoken reasoning in LSLMs leverages chain-of-thought (CoT) methodologies. Systems are fine-tuned on triplets $(Q^A,\ R^T,\ A^A)$ , where $Q^T$ (transcribed question) precedes $R^T$ (reasoning, bracketed by <start_cot> / <end_cot>), and $A^T$ (answer). Standard next-token cross-entropy is applied:

$L_{\rm SFT} = -\sum_t \log \pi_\theta(u_t|u_{<t})$

yielding substantial accuracy boosts (2.4× baseline on reasoning tasks, e.g., ARC-E from 30.2% to 77.7%) (Shih et al., 8 Oct 2025).

To reduce latency, LSLMs implement semantic triggers for early reasoning via question completeness $\zeta(p)$ :

$\zeta(p) = 1 - D_{KL}[X_N \| X_p] / D_{KL}[X_N \| X_0]$

where $X_p$ denotes the distribution over reasoning and answer given only the first $p$ words of the question. A threshold $\theta$ sets the inflection point, enabling reasoning to begin before the end of spoken query. Entropy proxies serve as simpler alternatives but are less robust. This mechanism yields fine-grained latency–accuracy trade-offs and traces out convex Pareto frontiers.

3. Policy-Making and Simultaneous Generation

LSLMs incorporate explicit policy-makers to optimize when to emit responses. In simultaneous generation settings, LLM-driven frameworks like LSG prompt the LLM to decide action $\{READ,\,WRITE\}$ at each time step. The core policy improvement relies on comparing KL divergence between current and baseline next-token distributions:

$\Delta_{KL} = D_{KL}(p_{\mathrm{cur}}\|p_{\mathrm{base}})$

where writing is triggered if $\Delta_{KL} > \delta$ or if model confidence exceeds $\alpha$ . This approach achieves state-of-the-art latency–quality trade-offs and does not require offline policy-module training (Guo et al., 1 Jan 2025).

Full-duplex LSLMs implement FSM-driven control tokens for responsive behaviors: [S.SPEAK], [C.SPEAK], [S.LISTEN], [C.LISTEN]. Each step involves maximizing either control-token or content-token probabilities,

$x_t^* = \arg\max_x P(x|x_{<t},a_{\leq t})$

with transition function $\delta(s,c)$ explicitly formalized.

4. Streaming Fusion and Interruption Handling

LSLMs realize simultaneous listening and speaking via streaming fusion approaches. Middle-fusion, injecting listening embeddings into each Transformer layer, shows optimal performance, preserving speech synthesis (WER near baseline) while providing rapid, precise interruption detection (precision/recall/F1 often $\geq97\%$ under noise) (Ma et al., 5 Aug 2024). Early fusion corrupts generation quality, while late fusion yields less robust interruption boundaries.

Interruption handling is enacted via special vocabulary tokens (e.g., IRQ), with the loss function:

$\mathcal{L}_{\mathrm{LS}}(\theta) = \begin{cases} -\sum_{t=1}^{t_{\mathrm{IRQ}}} \log P_\theta(r^q_t | R^q_{1:t-1}, S^p_{1:t-1}, C) & \text{(with interruption)} \ -\sum_{t=1}^{T_{\mathrm{EOS}}} \log P_\theta(r^q_t | R^q_{1:t-1}, S^p_{1:t-1}, C) & \text{(no interruption)} \end{cases}$

Interruptions are detected within 0.5 s and model output ceases accordingly.

5. Optimization Strategies and Accuracy–Latency Trade-Offs

Direct Preference Optimization (DPO) extends LSLM fine-tuning to maximize the Pareto frontier for accuracy and latency. LSLMs sample contrastive pairs, preferring shorter or more accurate traces:

$L_{\rm DPO}(\pi_\theta; \pi_{\rm ref}) = -\,\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma \left( \beta \left[\log \frac{\pi_\theta(y_w|x)}{\pi_{\rm ref}(y_w|x)}- \log\frac{\pi_\theta(y_l|x)}{\pi_{\rm ref}(y_l|x)}\right]\right) \right]$

Adding an NLL regularization on the preferred path stabilizes training.

Empirical findings show:

CoT fine-tuning: $\sim$ 2–3× gain on reasoning accuracy
Early reasoning (QC-based, $\theta=0.75$ ): $75\%$ latency reduction, $14\%$ drop in absolute accuracy
DPO for early CoT: restores $3-4\%$ accuracy with minimal latency cost
Length-based DPO: shrinks reasoning traces by $\sim$ 70% (e.g., $50\to15$ tokens) with no accuracy degradation (Shih et al., 8 Oct 2025)

6. Evaluation, Benchmarks, and Performance Metrics

LSLMs are evaluated under scenarios including ARC-E, ARC-C, SIQA, PIQA, GSM8K, LibriSpeech, and multi-agent social deduction environments:

Response latency: subsecond FTED, with $>50\%$ of responses under $500$ ms (Wang et al., 29 May 2024).
Interruption precision: LSLM achieves $8\%$ absolute gain over commercial models.
Duplexing robustness: models sustain WER close to vanilla TTS even under heavy noise, and precise turn-taking (F1 up to $98\%$ in controlled settings) (Ma et al., 5 Aug 2024).
Multi-agent games: listening-while-speaking agents double win rates compared to policy-only RL baselines, producing human-like grounded discussions and accurate hidden-state inference (Sarkar et al., 9 Feb 2025).

7. Limitations, Open Questions, and Future Directions

Current limitations include:

Model generalization under real-world accents and background noise.
Turn-taking beyond stop events, overlapping talkers, and multi-modal interruptions.
Computation overhead at inference, scaling for very large LLMs (quantization and sparse prediction are plausible future remedies) (Deng et al., 21 Dec 2024).

Open directions include integrating domain-specific adapters for speech or dialogue, extending LSLM capabilities to continuous multi-turn, cross-lingual, and multimodal (e.g., audio-visual) settings, and refining policy and preference learning for richer simultaneous interaction. Empirical extension to real recordings, low-resource scenarios, and rapid resetting for dynamic dialogue segmentation remain essential avenues.

LSLMs represent the foundational shift toward continuous, responsive, and reasoning-capable spoken language systems, leveraging multi-stream architectures, explicit reasoning triggers, streaming fusion for interruption resilience, and advanced optimization techniques to balance accuracy and latency for real-world deployment (Shih et al., 8 Oct 2025, Yuen et al., 25 Sep 2024, Ma et al., 5 Aug 2024, Wang et al., 29 May 2024, Guo et al., 1 Jan 2025, Deng et al., 21 Dec 2024, Sarkar et al., 9 Feb 2025, Novitasari et al., 2020).