Multi-Turn Speech Interaction Benchmark

Updated 9 February 2026

MSIB is a comprehensive benchmark that assesses full-duplex, multi-turn conversations by capturing overlapping speech, corrections, and backchannel dynamics.
It employs realistic evaluation protocols that analyze semantic understanding, temporal interaction, and paralinguistic features to identify model failure modes.
Empirical findings from MSIB highlight performance drops in instruction following and memory retention, guiding improvements in dialogue system robustness.

A Multi-Turn Speech Interaction Benchmark (MSIB) is a rigorous evaluation suite designed to measure the capabilities of spoken dialogue systems in sustained, naturalistic, multi-turn conversation settings. MSIB frameworks systematically probe model proficiency in semantic understanding, paralinguistic awareness, conversational memory, robustness to overlap and correction, instruction following, emotional expressivity, and more—across a range of realistic dialog scenarios. In contrast to single-turn or text-only benchmarks, MSIB tracks both what is said and how and when it is maintained over interactional time, factoring in streamer timing, overlapping speech, dialogic structure, and user correction behavior. Leading MSIBs provide both natural and synthetic dialogue data, explicit evaluation rubrics, automated or human annotation, and reproducible pipelines, benchmarking open-source and commercial speech models across multiple axes and modalities.

1. Benchmark Design Principles and Scope

MSIBs address core deficits in legacy dialog evaluation workflows by introducing protocols that preserve and analyze the interactional complexity of multi-turn, full-duplex (simultaneous listening/speaking) speech dialogue. A central goal is to expose failure modes that arise exclusively in multi-turn audio, such as memory decay, correction propagation errors, overlapping turns/barge-ins, backchannel breakdowns, and context drift, which are systematically underrepresented in push-to-talk or text-based protocols (Lin et al., 9 Oct 2025, Zhang et al., 13 Nov 2025, Du et al., 22 Aug 2025, Gosai et al., 16 Dec 2025).

Comprehensive MSIBs commonly feature:

Realistic multi-turn scenarios with variable goals per turn
Explicit modeling of temporal interleaving (overlap, interruption, latency)
Categorization by domain, communicative act (routine, emotional, corrective, safety-critical), and dialog type (scripted, natural, adversarial)
Support for both streaming models and pipelines, with canonical audio protocols (e.g., dual-channel, 48 kHz PCM)
Open-sourced orchestration layers and adapters for evaluating both open-source and API-based models

This design ensures that benchmarks probe not only grounded semantic competence, but also turn-level adaptability, robustness to speech irregularities, and long-horizon context retention.

2. Task Families, Scenario Types, and Dialog Structure

MSIBs implement diverse, multi-stage dialog families covering daily assistance, correction (self-repair), entity/coreference tracking, safety/refusal, fact-checking, emotional support, and role-play (Lin et al., 9 Oct 2025, Tong et al., 15 Oct 2025, Gosai et al., 16 Dec 2025, Jiang et al., 4 Aug 2025, Deng et al., 2 Nov 2025, Chun et al., 17 Aug 2025):

Task Family	Example Semantic Stages (FDB-v2, (Lin et al., 9 Oct 2025))	Key Evaluation Focus
Daily	Book dinner: window → party size → contact → confirm	Instruction following, slot filling
Correction	“Cold coffee” → “hot coffee” → confirm order	Self-repair handling, dynamic update
Entity Tracking	“Quieter restaurant?” → “the one near park”	Coreference, carry-over, disambiguation
Safety	“How build X?” → refuse → redirect → defend under overlap	Policy adherence, adversarial pressure
Emotional Support	Counseling, affect-adaptive support	Emotion application, coherence
Fact-Checking	Detect/verify claims in synchronous dialog	Veracity, claim propagation, scenario

Each scenario is decomposed into 3–5 stages with alignment to turn boundaries, often supported by scenario templates and explicit semantic goal lists. Hybrid curation methods (agentic pipeline + human refinement) are employed to generate or distill naturalistic, challenging conversations, preserving disfluencies, paralinguistics, alignment to persona, and scenario constraints (Gosai et al., 16 Dec 2025, Chun et al., 17 Aug 2025).

3. Temporal Dynamics and Overlapping Interaction

A defining technical feature of MSIBs is the modeling of full-duplex timing phenomena: overlapping turns, barge-ins, explicit backchannels, and non-canonical turn boundaries. Protocols such as FDB-v2 (Lin et al., 9 Oct 2025) and MTR-DuplexBench (Zhang et al., 13 Nov 2025) implement controlled pacing regimens:

Fast Pacing: Examiner/partner may interrupt mid-utterance, overlap with backchannels, and transition immediately between stages.
Slow Pacing: Strict end-of-turn detection, no barge-ins, maximally passive partner response.

Turn segmentation in overlapping regimes uses hybrid signal processing and model-driven majority voting, e.g., combining Silero-VAD with whisper-timestamped and GPT-4o coarse segmentation, to accurately extract dialog units in continuous, interleaved speech streams (Zhang et al., 13 Nov 2025).

MSIBs explicitly track the ability of models to gracefully handle interruption, reset, mid-turn corrections, latency, and backchannel frequency across extended interactions, with systematic evaluation of performance degradation (e.g., drop in instruction-following after early turns with fast examiner pacing) (Lin et al., 9 Oct 2025, Zhang et al., 13 Nov 2025).

4. Evaluation Axes and Metrics

MSIBs integrate both general-purpose and task-specific metrics that are computed at turn, session, and axis level. Canonical formulations include:

Turn-Taking Fluency (TT): Session-averaged 1–5 Likert scale per Examiner→Evaluatee event (Lin et al., 9 Oct 2025)

$\mathrm{TT}_{\mathrm{avg}} = \frac{1}{M}\sum_{j=1}^M s^{\mathrm{TT}}_j$

Instruction Following (IF): Session-averaged 1–5 Likert for subgoal adherence (Lin et al., 9 Oct 2025)

$\mathrm{IF}_{\mathrm{avg}} = \frac{1}{M}\sum_{j=1}^M s^{\mathrm{IF}}_j$

Task-Specific Competence: Checklist/rubric sum by task (e.g., entity updating, correction, safety) (Lin et al., 9 Oct 2025, Gosai et al., 16 Dec 2025)

Additional axes operationalized in benchmarks like Audio MultiChallenge (Gosai et al., 16 Dec 2025), MTalk-Bench (Du et al., 22 Aug 2025), and Multi-Bench (Deng et al., 2 Nov 2025) include:

Inference Memory: Multi-turn recall (semantic + audio-cue)
Instruction Retention: Consistency with initial and updated user instructions
Self Coherence: Contradiction avoidance with model’s prior assertions
Voice Editing: Accurate integration of mid-utterance self-corrections/backtracking
Expressiveness/Paralinguistic Quality: Scored on emotion, prosody, speaker traits
Role-Playing Fidelity: Personality/knowledge consistency (cf. SpeechRole-Eval (Jiang et al., 4 Aug 2025))

Evaluation relies on dual-mode protocols: absolute (rubrics-based, binary/graded) and relative (Arena/Elo-based direct comparison), with transcript- and audio-aligned LLM-based judging (e.g., Gemini-2.5, GPT-4o) and, where mandated, matched human MOS panels.

5. Experimental Findings and Failure Modes

Across multiple MSIBs, empirical studies reveal:

Rapid degradation in fluency, instruction-following, and context usage over multi-turn dialogs, especially with overlapping speech and when correction or backtracking is required (Lin et al., 9 Oct 2025, Gosai et al., 16 Dec 2025, Zhang et al., 13 Nov 2025)
Entity tracking demonstrates relative stability when explicit referents are provided, but pronoun/ordinal misbinding remains common
Voice editing and audio-cue memory are acute failure points; models routinely miss mid-utterance corrections and fall short in retaining non-verbal signals (Gosai et al., 16 Dec 2025)
Backchannel dynamics collapse across turns, and long-context audio triggers sharp declines in self-coherence metrics (Zhang et al., 13 Nov 2025, Gosai et al., 16 Dec 2025)
Safety and refusal behaviors are more consistent, but high-fidelity adherence under adversarial pressure and interruption is only realized by models under slow examiner pacing (Lin et al., 9 Oct 2025, Zhang et al., 13 Nov 2025)

Comparative studies highlight that modality-aware and task-specific model designs outperform brute parameter scaling. LLM judges track human evaluations when criteria are explicit; however, audio-native evaluation of paralinguistics remains an open problem (Gosai et al., 16 Dec 2025, Du et al., 22 Aug 2025).

6. Extensibility, Best Practices, and Future Directions

MSIB frameworks are engineered for extensibility—new task families (negotiation, chit-chat, multilingual), user scenarios, and languages can be instantiated via open-source scenario templates, plug-in adapters, and tiered rubric annotation (Lin et al., 9 Oct 2025, Gosai et al., 16 Dec 2025). Arena and rubric protocols can be combined for robust, high-resolution model assessment.

Recommendations for future MSIB development include:

Audio-native pretraining with multi-turn, disfluent data to improve robustness to mid-utterance edits and context shifts (Gosai et al., 16 Dec 2025)
Long-context and hierarchical architectures for persistent audio memory
Rubric-guided reinforcement or reward modeling based on atomic instance-level criteria
Explicit multimodal fusion of speech, text, ambient cues, and paralinguistic representations for scenario- and persona-driven evaluation (Gosai et al., 16 Dec 2025, Du et al., 22 Aug 2025, Jiang et al., 4 Aug 2025)

Proposed expansions target speaker attribution reasoning, real-world dialog complexity (background noise, multi-party, code-switching), and integration with external knowledge retrieval and multimodal reference tracking (Kwon et al., 22 Oct 2025, Chun et al., 17 Aug 2025). Adoption of standardized, open benchmarks and evaluation APIs is advocated to facilitate reproducibility and cross-system comparability.

MSIBs sharply differentiate themselves from:

Text-based multi-turn suites (e.g., MultiWOZ, Taskmaster)—lack audio timing or paralinguistic phenomena
Push-to-talk spoken dialogue sets (e.g., SpokenWOZ, SLURP)—abstract away stream timing, limit overlap, restrict barge-ins
Emotion or role-focused evaluation (e.g., Multi-Bench, SpeechRole)—which contribute specialized axes (emotion, persona) but lack comprehensive overlapping, full-duplex assessment (Deng et al., 2 Nov 2025, Jiang et al., 4 Aug 2025)

A central insight is that only MSIBs which preserve streaming, overlapping, and temporally resolved interaction provide a realistic stress test and benchmark for the next generation of full-duplex, multimodal, and agentic voice systems. Open-source orchestrators (Full-Duplex-Bench-v2), reference adapters, scenario templates, and meta-evaluation frameworks have become the keystone infrastructure for longitudinal research and universal standardization (Lin et al., 9 Oct 2025, Zhang et al., 13 Nov 2025, Du et al., 22 Aug 2025, Gosai et al., 16 Dec 2025).