Duplex Interaction Alignment

Updated 16 June 2026

Duplex Interaction Alignment is a paradigm that synchronizes continuous input and output streams, enabling real-time, natural dialogue in AI systems.
It employs dual-stream transformers, token interleaving, and control heads to optimize turn-taking, minimize latency, and manage proactive behaviors.
Empirical evaluations in systems like SALM-Duplex and MiniCPM-o 4.5 highlight its effectiveness in bridging the gap between traditional turn-based and fluid, duplex communication.

Duplex interaction alignment refers to the set of modeling, data, and optimization strategies that enable artificial agents—primarily large language or multimodal models—to process, synchronize, and align simultaneous input and output streams in real-time dyadic or multi-party communication settings. This paradigm is distinguished from conventional turn-based systems by the ability to continuously ingest user input (speech, text, or multimodal cues) while generating uninterrupted agent output, yielding natural turn-taking, low-latency response to interruptions, real-time barge-in handling, and, in advanced systems, proactive or tool-triggering behaviors. This article surveys the principles, architectures, objective functions, empirical protocols, and evaluation techniques underpinning duplex interaction alignment across speech, multimodal, and communication system domains.

1. Foundations and Motivation

Traditional turn-based human-computer interaction protocols—where input and output phases alternate with rigid boundaries—limit responsiveness, prevent user interruptions during output generation, and diverge from natural human conversation dynamics. Duplex interaction alignment addresses these shortcomings by enabling synchronous "listen-and-speak" models that achieve fluid, human-like engagement.

Contemporary systems instantiate this paradigm across speech-to-speech language modeling (e.g., SALM-Duplex (Hu et al., 21 May 2025)), multi-channel multimodal frameworks (e.g., MiniCPM-o 4.5 (Cui et al., 30 Apr 2026), DuplexOmni (Huang et al., 8 Jun 2026)), as well as multi-agent environments in communications (e.g., MIMO two-way interference alignment (Fouladgar et al., 2015)). Key challenges include aligning input and output streams that differ in sequence length and semantics; ensuring timely, semantically-tight agent responses (pause, barge-in, backchannel); minimizing latency; and maintaining model fluency and dialogue safety.

Duplex alignment unifies approaches from autoregressive token synchronization, preference-based objective shaping, explicitly supervised temporal coupling, and RL-based behavioral refinement, all of which are aimed at optimizing both content and interaction-level metrics over continuous, streaming exchanges (Wu et al., 26 Jun 2025, He et al., 17 May 2026, Ohashi et al., 9 Jun 2026).

2. Architectural Mechanisms for Duplex Alignment

2.1 Dual-Stream and Multi-Channel Backbones

Modern duplex models ingest two or more continuous streams:

User input: Causal embeddings or tokens generated from streaming encoders (e.g., FastConformer-CTC in SALM-Duplex (Hu et al., 21 May 2025), Whisper-derived ASR in MiniCPM-o 4.5 (Cui et al., 30 Apr 2026)).
Agent output: Discrete tokens (text, audio codec codes, or multimodal representations) generated via neural audio codecs (e.g., NanoCodec), TTS heads, or multi-modal decoders.
Additional channels: Action or tool channels for in-conversation planning or real-time tool invocation (e.g., action channel in DuplexSLA (Zhang et al., 20 May 2026)).

Fusion typically occurs at every fixed "chunk" or frame (e.g., 80/160/1000 ms), via embedding summation and positional encoding. Time alignment is preserved by both hard serialization and the use of explicit boundary tokens per stream.

2.2 Self-Attention and Fusion

A common design is to unify the time-aligned input and output embeddings into a single transformer backbone, allowing the model's self-attention layers to mix information across user, agent, and (optionally) planning channels. Alignment is driven by the attention mechanism itself, with no bespoke cross-modal block required (e.g., channel fusion in SALM-Duplex, positional and modal adapters in MiniCPM-o 4.5).

Advanced modalities may employ modular attention mechanisms to decouple inner-modal refinement and inter-modal interaction—see the "correct-after-align" duplex attention of MODA (Zhang et al., 7 Jul 2025), which utilizes Gram-based basis vectors for explicit cross-modal mapping, adaptive masking for information flow calibration, and decoupled self/cross-attention mechanisms.

2.3 Control Heads and Proactive Generation

Duplex-capable systems often add lightweight control heads to predict "when to respond" or "when to speak," separate from "what to say." These may be simple softmax classifiers over special tokens (e.g., “CONTINUE_SPEECH,” “STOP_FOR_USER,” “BACKCHANNEL” in MinMo (Chen et al., 10 Jan 2025)) or binary/structured control tokens regulating output channel gating (MiniCPM-o 4.5, DuplexOmni). This design supports proactive behaviors and enables rapid adaptation to user interruption, barge-in, or salient event detection.

3. Objective Functions and Data Alignment

3.1 Multi-Channel Losses and Weighted Objectives

Supervised training typically employs per-channel next-token cross-entropy losses. For example, SALM-Duplex minimizes

$\mathcal{L} = \alpha\,\mathcal{L}_\text{text} + \beta\,\mathcal{L}_\text{speech}$

with empirically chosen weights ( $\alpha=3,\ \beta=1$ ), prioritizing text prediction to enhance reasoning without degrading codec generation (Hu et al., 21 May 2025).

Auxiliary tasks (e.g., CTC-aligned ASR losses) further enhance channel-specific supervision—DuplexSLA employs chunk-snapped CTC for tight audio/text correspondence (Zhang et al., 20 May 2026). Silence masking and data augmentation (e.g., cut-off on user interruption, silence spans on pause) simulate conversational phenomena, teaching the model when to emit speech tokens or remain idle.

3.2 Frame- and Sentence-Level Alignment

Many systems rely on turn-level or frame-level alignment rather than word-level forced alignment. For instance, MinMo's duplex phase aligns at the chunk or control token granularity, only requiring sentence- or event-level timestamps (Chen et al., 10 Jan 2025). SCoT introduces CTC-forced alignment of every input and output frame, segmenting continuous interaction into blocks where explicit Chain-of-Thought targets are computed per block (Arora et al., 2 Oct 2025). In contrast, FLM-Audio aligns "natural" monologue sentences with the audio stream, handling lead/lag via a dual-format training regimen without word-level timestamp supervision (Yao et al., 2 Sep 2025).

3.3 Preference Feedback and RL Post-Training

Alignment is further refined by optimizing directly for behavioral preferences using datasets of human or AI-annotated preference pairs. DPO or RL-based post-training uses axis-specific rewards for core interactive behaviors—pause handling, turn-taking, backchannel, and interruption—jointly with LLM-based semantic quality scores (Wu et al., 26 Jun 2025, Ohashi et al., 9 Jun 2026). Reward shaping penalizes ill-timed takeovers or silence and reinforces prompt, context-appropriate responses, yielding improvements in both timing and dialogue quality.

4. Empirical Evaluation and Benchmarking

4.1 Real-Time, Multi-Axis Task Suites

Comprehensive duplex benchmarks include Omni-DuplexEval (He et al., 17 May 2026), DuplexSLA-Bench (Zhang et al., 20 May 2026), and Full-Duplex-Bench (Ohashi et al., 9 Jun 2026). Tasks test real-time description (live event narration, object counting, fine-grained action segmentation), proactive reminders and tool triggers, interruption response, pause detection, and backchannel behavior. Evaluation metrics span:

Content quality and consistency (LLM-as-a-Judge, content scores)
Temporal alignment (precision/recall, token-level or event-aligned latency)
Proactive detection and response rates
Human vs. model performance gaps

Scores are aggregated over scenario types and further analyzed for win rates, latency, and human alignment.

Table: Representative Findings

Model/Task	Duplex RTD Score	PR Success	Turn Latency	Speech Quality (UTMOS)
SALM-Duplex	83.0% (barge-in)	—	0.52 s	4.3
MiniCPM-o 4.5	59.1 (RTD)	20.0	0.59 s	—
DuplexSLA	≥93.3% (turn-take)	—	≤0.4 s	—
Human Duplex	70.8	92.8	—	—

4.2 Synchronization and Internal Alignment

Analysis using Centered Kernel Alignment (CKA) reveals tight representational synchronization between speaker and listener models during full-duplex exchange, peaking at near-zero lag and degrading with noise or aggressive decoding bias (Riera et al., 19 May 2026). Causal LSTM probes verify that turn-taking cues are encoded in model internal states, supporting anticipation of speech boundaries up to ∼1 s in advance. These metrics enable quantitative diagnostics of interaction-level alignment, complementing surface behavioral evaluation.

5. Innovations in Duplex Alignment Methodologies

5.1 Streaming and Time-Chunked Inference

Duplex models frequently partition the time axis into fixed-size "chunks" (e.g., 160 ms in DuplexSLA (Zhang et al., 20 May 2026); 480 ms in DuplexOmni (Huang et al., 8 Jun 2026)). Within each chunk, all modalities are serialized, and prediction proceeds autoregressively—enforcing strict causal dependence and enabling zero-latency switchovers.

5.2 Modular and Asynchronous Collaboration

Separation of interaction and thinking layers, as in DuplexOmni, allows real-time (RTF < 1) interaction via parallel scheduling, separate context queues, and preemption through explicit control tokens (Huang et al., 8 Jun 2026). This enables deep reasoning to proceed asynchronously and prevents blocking or deadlock during interaction.

5.3 Token Interleaving and RoPE-Based Synchronization

DyaPlex achieves multi-modal alignment by "dyadic token interleaving"—serializing both agent and partner's motion/audio tokens into a single stream, and employing frame-aligned rotary positional encoding across modalities. This biases cross-attention to attend diagonally on concurrent frames, implicitly enforcing tight temporal coupling without additional supervision (Nagano et al., 2 Jun 2026).

5.4 Minimal Overhead Full-Duplexization

The DUO channel-division-multiplexing strategy demonstrates that off-the-shelf LLMs can acquire full-duplex behavior via 'gated' dual channel decoding, relying on two state tokens (<1> for switching, <2> for continuing), disjoint post-prefix attention masks, and minimal additional data or parameterization (Xu et al., 2024).

6. Challenges, Open Problems, and Future Directions

Despite progress, current duplex models exhibit significant performance gaps versus human real-time interaction, especially in proactive reminder and structured reasoning tasks (He et al., 17 May 2026). Ongoing challenges include:

Achieving content coverage parity while avoiding excessive silence (models often refrain from speaking for half to two-thirds of the live stream).
Generalizing from offline, synthetic, or monadic datasets to genuine duplex or dyadic human interaction—with full prosodic, overlapping, and cross-modal cues (Yao et al., 2 Sep 2025).
Efficient scaling to longer, multi-turn conversations and richer, tool-mediated multi-agent environments (Huang et al., 8 Jun 2026).
Developing robust metrics and internal diagnostics for alignment quality beyond response fidelity, incorporating synchronization, anticipation, and multimodal grounding (Riera et al., 19 May 2026).

Research directions advocated include RL or integrated planning policies for simultaneous "when/what" optimization, design of explicit event memory, chunk-level and frame-level action serialization, richer evaluation frameworks for human alignment, and extending duplex alignment to edge deployments and physically embodied agents (Cui et al., 30 Apr 2026, He et al., 17 May 2026, Huang et al., 8 Jun 2026).

7. Applications and Cross-Domain Generalization

Duplex interaction alignment is relevant across spoken dialogue systems, multimodal LLMs, haptic/robotic agents, and wireless communication protocols (interference alignment in full/half-duplex channels (Fouladgar et al., 2015, Wu et al., 2023)). Its principles—time-synchronized cross-stream serialization, architectural fusion for low-latency behavior, reward-shaping for interactive fluency, and explicit chunk-wise supervision—form a general recipe for modeling coordinated, real-time multi-channel environments where input-output synchrony, low latency, and interactional naturalness are critical.

Key Citations: