DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Published 20 May 2026 in eess.AS | (2605.20755v1)

Abstract: Recent advances in spoken dialogue LLMs have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and tool calling, leaving real-time agentic behaviour either tied to turn boundaries or relegated to an external cascade. We propose DuplexSLA, a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone, so that listening, speaking, planning, and tool calling unfold on one shared clock. Two capabilities define the model: (1) semantic-driven turn-taking control, where interruption, pause, and backchannel are handled inside the same backbone instead of by an external semantic VAD; and (2) in-conversation planning and tool calling, where planning text and structured tool calls are emitted on the action channel without halting assistant audio, so that multi-action and backchannel-triggered tool use are interleaved with ongoing speech. To evaluate these capabilities together, we further construct DuplexSLA-Bench, a duplex benchmark covering pause, interrupt, and backchannel turn-taking together with three styles of in-conversation tool calling. Our project page, interactive demos, and the DuplexSLA-Bench evaluation suite are publicly available at https://github.com/hyzhang24/DuplexSLA.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Summary

The paper introduces a unified full-duplex model that embeds listening, speaking, planning, and tool invocation into a time-synchronized three-channel backbone.
It demonstrates significant latency improvements with sub-second tool-call and turn-taking performance, outperforming traditional ASR+LLM cascades.
The native action channel enables real-time planning and multi-action tool calls, paving the way for richer, multimodal spoken dialogue systems.

DuplexSLA: A Full-Duplex Foundation Model for Synchronized Speech, Language, and Action

Introduction

DuplexSLA introduces a native full-duplex speech-language-action (SLA) foundation model, addressing the limitations of turn-based spoken dialogue systems and existing duplex backbones by integrating synchronized speech, semantic turn-taking, and real-time tool calling. Unlike traditional ASR+LLM+TTS cascades or models dependent on external semantic VADs, DuplexSLA embeds listening, speaking, planning, and tool invocation into a unified, time-aligned backbone. This technical leap enables the model to exhibit agentic speech behaviours—natural interruptions, backchanneling, and interleaved tool execution—on a strict conversational clock with minimal latency.

Model Formulation and Architecture

The core innovation is a dual-stream three-channel setup with fixed 160 ms conversational chunks:

User Channel: Causally encoded user audio (2×80 ms features per chunk).
Assistant Channel: Discrete TA4 assistant speech (text anchor plus 4 discrete audio tokens per chunk at 40 ms stride).
Action Channel: Rate-limited (≤10 tokens per chunk) textual stream carrying delayed transcripts, turn-taking labels (interrupt, backchannel, response), free-form planning, and JSON-style tool calls.

The LLM backbone autoregressively generates the assistant TA4 and the action channel in lockstep, while the user audio is processed causally and not predicted by the model. The action channel is a critical design element: it externalizes interaction control and tool use into a time-aligned textual lane, decoupled from the assistant speech. All behavioral logic is embedded natively—there is no recourse to external VAD or rule-based controllers.

Chunks are serialized for the LLM as:

1	<\|user_audio_begin\|> UU <\|user_audio_end\|> <\|assistant_audio_begin\|> TAAAA <\|assistant_audio_end\|> (action text) <\|action_end\|>

This scheme ensures strict temporal synchronization between agent audio, textual actions, and external functions, enabling real-time agentic behavior.

Training Data and Recipe

DuplexSLA necessitates a custom data pipeline due to the misalignment between standard dialogue corpora and the dual-stream three-channel structure. The data construction pipeline:

Annotation: An LLM semantically annotates raw dialogues with tool-call objects, rationale fragments, and turn-taking events, each aligned to chunk-indexed semantic triggers.
Synthesis: User and assistant utterances are force-aligned and voice-cloned, merged to reflect chunk-level semantics.
Supervision: Assistant TA4 and action streams are supervised directly; the user audio stream provides only observed inputs.

Training is staged:

Continued Pretraining (CPT): 500k hours (320k duplex dialogues, 2×90k ASR, 1.92M text samples) introduces the backbone to dual-stream three-channel serialization, with explicit time alignment and silence handling.
Capability-Oriented Post-Training: 50k hours, specializing in interaction-control (pause, interrupt, backchannel) and three tool-call styles, finely tunes the model for subsecond interaction behaviours and time-critical agentic actions.

Dual-side ASR is critical for enforcing time alignment between speech and action emission, a prerequisite for low-latency tool use.

Full-Duplex Turn-Taking and Tool Calling

DuplexSLA's fundamental distinction is the integration of semantic turn-taking decisions and in-conversation tool planning into the same decoding loop as assistant speech, yielding several unique behaviors:

Semantic-Driven Turn-Taking: Pauses, interruptions, and backchannels are managed from the model's internal representation, not heuristics or external VAD postprocessing. The model emits control labels on the action channel, instantaneously altering assistant speech flow with chunk-level granularity.
Native In-Conversation Planning and Tool Calls: Planning text and structured tool calls are emitted on the action channel synchronously, without halting or modifying the assistant speech stream. Backchannel-triggered and multi-action tool calls are naturally interleaved, annotated with explicit timestamps.

This coordinated emission schema eliminates the latency penalty endemic to turn-based or cascade architectures.

Evaluation and Numerical Results

Evaluation is conducted on DuplexSLA-Bench—a dedicated 2,100-case duplex benchmark—covering four turn-taking scenarios (normal, pause, interrupt, backchannel) and three tool-call styles (single, multi, backchannel-action).

Tool-Call Performance

Latency: DuplexSLA achieves average tool-call delays of 0.64s (single), 0.68s (multi), and 0.57s (backchannel-action)—~4× faster than the ASR+LLM cascade baseline (delays of 2.33s–4.71s).
Accuracy: DuplexSLA matches or closely trails cascade accuracy (85.6–96.0%), with the largest drop in the more difficult multi-action setting.

Turn-Taking

Backchannel Handling: DuplexSLA attains 98.33% correct backchannel responses, substantially outperforming all baselines (≤40%).
Latency: Sub-second reaction in all scenarios (0.27s–0.40s), consistently lower than commercial (Gemini, GPT-Realtime) and open-source (MiniCPM-o, PersonaPlex) competitors (typically 0.6–1.7s).
Pause/Interrupt: 93–99% accuracy across scenarios, validating robust chunk-level semantic control.

Baselines either introduce significant delay or degrade on scenarios (such as pause/backchannel) not natively represented in their architectures.

Implications and Future Directions

DuplexSLA demonstrates that agentic, real-time spoken dialogue with precise tool use and nuanced turn-taking is achievable within a single backbone, given appropriate data and serialization. The explicit action channel is critical—not just for tool-use speed but for modeling the intertwined timing of natural conversation. The approach opens several future directions:

Richer Action/Planning Channels: Extending the action lane to complex sequential workflows, more expressive agentic planning, and richer multimodal or task-specific APIs.
Open-Domain/Device Integration: Seamless control of broader on-device and cloud-based tool ecosystems, with real-time synchronization between speech and action.
Multilinguality and Emotional Intelligence: Time-synchronized affective and multilingual capabilities to match the complexity of human spoken interaction.
Fine Latency Control and Deployment Agility: Architectural decoupling of the per-chunk token budget enables practical tuning for diverse hardware and user experience constraints.

Conclusion

DuplexSLA establishes an integrated framework for full-duplex speech-language-action, unifying semantic turn-taking, structured planning, and tool use on a strict conversational clock. It achieves state-of-the-art subsecond latency across nuanced turn-taking and tool-call scenarios, outperforming both pipeline and alternative duplex architectures in responsiveness and fluency. The model makes a compelling case for explicit, time-synchronized action channels as the foundation for future agentic spoken dialogue systems (2605.20755).

Markdown Report Issue