Listening-while-Speaking Language Model (LSLM)
- LSLM is a speech-driven neural system designed for continuous, interactive spoken communication by interleaving listening and speaking in real time.
- It employs multi-stream architectures and chain-of-thought reasoning to optimize timing, interruption handling, and latency–accuracy trade-offs.
- Empirical evaluations reveal rapid response times, robust interruption detection, and enhanced performance under noisy and dynamic conditions.
A Listening-while-Speaking LLM (LSLM) is a speech-driven neural system designed for full-duplex, real-time spoken interaction, integrating both continual perception of user audio and immediate response generation. Unlike turn-based dialogue systems, LSLMs fuse listening and speaking by interleaving input streams, controlling output timing, and supporting interruption, which yields human-like responsiveness and robust interaction even under challenging reasoning or latency constraints. Recent frameworks implement LSLMs using sophisticated multi-stream architectures, chain-of-thought reasoning triggers, preference-driven optimization strategies, and explicit mechanisms for policy adaptation, interruption detection, and streaming fusion.
1. Fundamental Architectures in Listening-while-Speaking LMs
LSLMs employ multi-stream or unified autoregressive designs to process input and output concurrently. For speech reasoning, a canonical architecture utilizes streams for user audio tokens (), system audio tokens (), and system text tokens (), all tightly time-aligned. At each timestep, the streams are updated as follows:
The model maximizes via negative log-likelihood over all streams. Text tokens are padded and interleaved such that audio and monologue channels remain fully revisable mid-utterance.
Alternative approaches include speech-to-speech LLMs with implicit chain-of-thought (ICoT) internalization that gradually drop explicit transcription steps in training, as described for A-T-A systems, compressing ASR reasoning into latent model states (Yuen et al., 25 Sep 2024). End-to-end designs may also fuse streaming self-supervised encoders for live audio with decoder-only TTS blocks and integrate both channels at multiple points (early/middle/late fusion) for robust interruption handling (Ma et al., 5 Aug 2024). Modular full-duplex systems coordinate LLMs with neural finite state machines (FSM), streaming ASR, and TTS, presenting interaction as next-token autoregression on a serialized tape (Wang et al., 29 May 2024).
2. Reasoning and Timing: Chain-of-Thought and Question Completeness
Complex spoken reasoning in LSLMs leverages chain-of-thought (CoT) methodologies. Systems are fine-tuned on triplets , where (transcribed question) precedes (reasoning, bracketed by <start_cot> / <end_cot>), and (answer). Standard next-token cross-entropy is applied:
yielding substantial accuracy boosts (2.4× baseline on reasoning tasks, e.g., ARC-E from 30.2% to 77.7%) (Shih et al., 8 Oct 2025).
To reduce latency, LSLMs implement semantic triggers for early reasoning via question completeness :
where denotes the distribution over reasoning and answer given only the first words of the question. A threshold sets the inflection point, enabling reasoning to begin before the end of spoken query. Entropy proxies serve as simpler alternatives but are less robust. This mechanism yields fine-grained latency–accuracy trade-offs and traces out convex Pareto frontiers.
3. Policy-Making and Simultaneous Generation
LSLMs incorporate explicit policy-makers to optimize when to emit responses. In simultaneous generation settings, LLM-driven frameworks like LSG prompt the LLM to decide action at each time step. The core policy improvement relies on comparing KL divergence between current and baseline next-token distributions:
where writing is triggered if or if model confidence exceeds . This approach achieves state-of-the-art latency–quality trade-offs and does not require offline policy-module training (Guo et al., 1 Jan 2025).
Full-duplex LSLMs implement FSM-driven control tokens for responsive behaviors: [S.SPEAK], [C.SPEAK], [S.LISTEN], [C.LISTEN]. Each step involves maximizing either control-token or content-token probabilities,
with transition function explicitly formalized.
4. Streaming Fusion and Interruption Handling
LSLMs realize simultaneous listening and speaking via streaming fusion approaches. Middle-fusion, injecting listening embeddings into each Transformer layer, shows optimal performance, preserving speech synthesis (WER near baseline) while providing rapid, precise interruption detection (precision/recall/F1 often under noise) (Ma et al., 5 Aug 2024). Early fusion corrupts generation quality, while late fusion yields less robust interruption boundaries.
Interruption handling is enacted via special vocabulary tokens (e.g., IRQ), with the loss function:
Interruptions are detected within 0.5 s and model output ceases accordingly.
5. Optimization Strategies and Accuracy–Latency Trade-Offs
Direct Preference Optimization (DPO) extends LSLM fine-tuning to maximize the Pareto frontier for accuracy and latency. LSLMs sample contrastive pairs, preferring shorter or more accurate traces:
Adding an NLL regularization on the preferred path stabilizes training.
Empirical findings show:
- CoT fine-tuning: 2–3× gain on reasoning accuracy
- Early reasoning (QC-based, ): latency reduction, drop in absolute accuracy
- DPO for early CoT: restores accuracy with minimal latency cost
- Length-based DPO: shrinks reasoning traces by 70% (e.g., tokens) with no accuracy degradation (Shih et al., 8 Oct 2025)
6. Evaluation, Benchmarks, and Performance Metrics
LSLMs are evaluated under scenarios including ARC-E, ARC-C, SIQA, PIQA, GSM8K, LibriSpeech, and multi-agent social deduction environments:
- Response latency: subsecond FTED, with of responses under $500$ ms (Wang et al., 29 May 2024).
- Interruption precision: LSLM achieves absolute gain over commercial models.
- Duplexing robustness: models sustain WER close to vanilla TTS even under heavy noise, and precise turn-taking (F1 up to in controlled settings) (Ma et al., 5 Aug 2024).
- Multi-agent games: listening-while-speaking agents double win rates compared to policy-only RL baselines, producing human-like grounded discussions and accurate hidden-state inference (Sarkar et al., 9 Feb 2025).
7. Limitations, Open Questions, and Future Directions
Current limitations include:
- Model generalization under real-world accents and background noise.
- Turn-taking beyond stop events, overlapping talkers, and multi-modal interruptions.
- Computation overhead at inference, scaling for very large LLMs (quantization and sparse prediction are plausible future remedies) (Deng et al., 21 Dec 2024).
Open directions include integrating domain-specific adapters for speech or dialogue, extending LSLM capabilities to continuous multi-turn, cross-lingual, and multimodal (e.g., audio-visual) settings, and refining policy and preference learning for richer simultaneous interaction. Empirical extension to real recordings, low-resource scenarios, and rapid resetting for dynamic dialogue segmentation remain essential avenues.
LSLMs represent the foundational shift toward continuous, responsive, and reasoning-capable spoken language systems, leveraging multi-stream architectures, explicit reasoning triggers, streaming fusion for interruption resilience, and advanced optimization techniques to balance accuracy and latency for real-world deployment (Shih et al., 8 Oct 2025, Yuen et al., 25 Sep 2024, Ma et al., 5 Aug 2024, Wang et al., 29 May 2024, Guo et al., 1 Jan 2025, Deng et al., 21 Dec 2024, Sarkar et al., 9 Feb 2025, Novitasari et al., 2020).