MTR-DuplexBench Evaluation
- MTR-DuplexBench is a comprehensive benchmark that segments overlapping multi-turn dialogues to evaluate full-duplex speech language models with high granularity.
- It employs automated algorithms and statistical methods to isolate turn-level performance, ensuring objective evaluation of dialogue quality, latency, and instruction following.
- For wireless systems, it offers simulation recommendations to accurately model non-reciprocal antenna patterns in FDD massive MIMO, enhancing channel realism.
MTR-DuplexBench is a benchmark designed for comprehensive, multi-round evaluation of Full-Duplex Speech LLMs (FD-SLMs), and, in a separate context, provides recommendations and statistical procedures for simulating realistic antenna duplex pattern non-reciprocity in FDD massive MIMO research. In speech AI, MTR-DuplexBench segments continuous, overlapping, two-channel real-time dialogues into discrete user turns and rigorously quantifies dialogue quality, conversational dynamics, instruction following, and safety at each turn. For wireless communications, it informs simulation frameworks for handset-specific duplex pattern divergence, enabling accurate modeling of non-reciprocal spatial channels. Both uses emphasize multi-dimensional, turn-by-turn analysis, statistical realism, and automated, objective scoring protocols (Zhang et al., 13 Nov 2025, Eggers et al., 2020).
1. Motivation and Objectives
Full-Duplex Speech LLMs are characterized by the ability to decode and synthesize overlapping user and assistant speech streams in real time. Unlike half-duplex models, FD-SLMs participate in speech phenomena such as backchannels, interruptions, and overlapping utterances. Prior evaluation benchmarks generally focus on single-round, non-overlapping dialogue, failing to capture the complexity and instability introduced by multi-turn, full-duplex interaction (Zhang et al., 13 Nov 2025).
The key challenges for evaluation are:
- Blurred turn boundaries: Continuous full-duplex audio lacks clear demarcation between user turns.
- Context inconsistency: In multi-turn settings, once a model's response diverges from ground truth, subsequent user inputs become misaligned with the new model context, reducing evaluation fidelity.
MTR-DuplexBench addresses these by:
- Formally segmenting continuous dual-channel audio into discrete “user turns.”
- Ensuring each evaluation window uses ground-truth context up to the current turn, then only scoring newly generated audio, thereby isolating per-turn model performance.
In wireless systems research, modeling non-reciprocity in user equipment (UE) antenna patterns over FDD duplex bands is essential for realistic FDD massive MIMO simulation. MTR-DuplexBench’s recommendations include sampling and injecting empirically measured pattern divergence statistics (power, polarization, front-back gain, complex correlation) into spatial channel models (Eggers et al., 2020).
2. Segmentation and Formal Definitions
The segmentation protocol in MTR-DuplexBench is designed to robustly parse overlapping, continuous dialogue into discrete, evaluable turns. Given dual-channel audio , initial voice activity detection (VAD) and Whisper transcription are applied to both channels:
A four-step algorithm generates the segmentation:
- VAD segment extraction for both channels.
- Six parallel GPT-4o calls to propose candidate turn boundaries.
- Clustering and majority voting of boundary proposals with overlap ≥30% to stabilize boundary selection.
- Merging of any remaining overlapping turns, producing a final ordered set .
The assistant response window for each turn is defined as , during which all previous assistant audio channels are replaced by ground-truth while only newly-generated model output for the current turn is evaluated.
This segmentation rigorously isolates turn-level performance, crucial in multi-round, overlapping evaluation where contextual drift otherwise impedes interpretability (Zhang et al., 13 Nov 2025).
3. Dataset Construction and Turn Statistics
MTR-DuplexBench draws on four curated data sources, covering the four principal evaluation dimensions (see Table 1):
| Dimension | Source | #Samples | Rounds | Type | Metric |
|---|---|---|---|---|---|
| Dialogue Quality | Candor (natural) | 200 | 1–11 | Natural | GPT-score |
| Conversational Features | GPT-4o + CosyVoice2 | 200 | 10 | Synthetic | Success/Latency |
| Instruction Following | Llama Question (OpenAudioBench) | 300 | 10 | Synthetic | Binary GPT-score |
| Safety | AdvBench | 520 | 10 | Synthetic | Refusal Rate |
Each dimension is constructed around dialogue rounds lasting up to 10–11 turns, with synthesized or natural speech input depending on the task focus. Candor segments real 120s recordings via the segmentation algorithm. Synthetic dialogues are generated and synthesized using CosyVoice2 to ensure controlled, stress-testing of conversational phenomena (multi-feature overlap, interruptions, backchanneling). Instruction and safety datasets re-sequence spoken queries to sustain multi-turn evaluation (Zhang et al., 13 Nov 2025).
4. Evaluation Metrics and Protocol
Turn-by-turn evaluation is defined along four axes, with automated pipeline execution for reproducibility and scale:
Dialogue Quality (Q):
GPT-4o, prompted with sentence-aligned transcriptions (via Whisper-large-v3 and stable-ts) for each round, assigns scores in . is averaged over all turns and dialogues.
Conversational Dynamics (D):
For each conversational feature , per-round binary success indicators are scored:
Additional metrics include:
- Latency : time from user speech end to first assistant token.
- Backchannel frequency : detected per round.
Instruction Following (I):
Binary GPT-4o judgement for assistant instruction adherence:
Safety (S):
Refusal rate to harmful/adversarial prompts:
Pipelines are fully automated for: segmentation, transcription, audio alignment, GPT-based scoring (GPT-4o), and extraction of latency, backchannel, and success statistics. No statistical significance tests are reported; trend analysis is emphasized via performance curves over dialogue rounds (Zhang et al., 13 Nov 2025).
5. Results: Model Performance and Multi-Round Degradation
Applying MTR-DuplexBench to the “Moshi” end-to-end FD-SLM reveals:
- Dialogue Quality: Moshi attains on Candor, indicating participation in naturalistic speech but low semantic coherence.
- Conversational Features: Feature-wise success rates steadily decline over to turns (e.g., smooth turn-taking: 57% → 48.6%). Latencies increase by up to – over extended dialogues. Backchannel frequency also drops, evidencing diminished conversational engagement. Background speech scenarios yield the highest degradation.
- Instruction Following: Task-compliance, , falls from approximately 68% to 42% over 10 rounds in both smooth and interrupted conditions.
- Safety: Refusal rate remains robust (around 90%) across all rounds, indicating resilient harm-avoidance even under stress.
Multi-feature combinations nonlinearly stress the model: some features interact to accelerate overall degradation, rather than strictly additive burdens. Key failure modes include context drift (semantic misalignment due to non-ground-truth context), latency accumulation (increasing response time), and reduced backchannel/overlap management (Zhang et al., 13 Nov 2025).
6. Recommendations and Extensions
The analysis of MTR-DuplexBench highlights persistent limitations in current FD-SLMs, particularly in sustaining high-quality, responsive, instruction-following conversation over many turns. The following priorities are identified for future FD-SLM research:
- Incorporate multi-turn latency as a primary evaluation metric.
- Design training schemes and architectures that explicitly maintain backchanneling and overlap handling consistency.
- Develop more robust context and planning modules to mitigate context drift and ensure multi-round coherence.
- Expand evaluation breadth to include multilingual data, varied real-world conditions, and alternative FD-SLM architectures (e.g., cascaded versus end-to-end).
The planned public release of segmentation code, data splits, and scoring prompts is intended to catalyze broader adoption and improvement of robust, human-like FD-SLMs (Zhang et al., 13 Nov 2025).
7. Application to Antenna Duplex Pattern Modeling in FDD Massive MIMO
In a distinct domain, MTR-DuplexBench informs procedures for rigorous simulation of antenna duplex pattern an-reciprocity in FDD massive MIMO user equipment:
- Empirical divergence in scalar and complex antenna patterns is measured across commercial handsets and frequency bands, with statistics (median, percentiles) for metrics such as , , , and correlation parametrized by distributional fits (Gaussian, Beta, Pareto).
- Simulators should sample pattern divergence from these distributions, perturb reference gain/patterns for uplink and downlink accordingly, and model frequency-separation dependence in correlation.
- Proposed workflows include random selection of handset class, band, and injection of pattern divergence via either scalar perturbation of gain maps or principal-component mixing for target complex correlation, to reflect true non-reciprocity in synthetic channel realization.
This detailed modelling is essential for credible FDD performance assessments, especially in the presence of severe pattern mismatch typical in some commercial handsets (Eggers et al., 2020).
References: