Full-Duplex-Bench-v2 Evaluation Framework
- Full-Duplex-Bench-v2 is a multi-turn, streaming-native evaluation framework that assesses simultaneous speaking and listening in dialogue systems.
- It integrates an automated examiner, real-time orchestrator, and model adapters to ensure consistent, low-latency audio handling and turn management.
- The framework employs detailed metrics for turn-taking fluency, correction processing, and safety compliance, enabling reproducible system diagnostics.
Full-Duplex-Bench-v2 (FDB-v2) is a multi-turn, streaming-native evaluation framework developed to systematically assess full-duplex spoken dialogue systems under realistic, interactive, and multi-step conversational conditions. Unlike prior turn-based or offline benchmarks, FDB‑v2 centers on the simultaneous speaking and listening capabilities intrinsic to full-duplex agents, with a focus on conversational fluency, real-time correction, entity tracking, safety compliance, and robust handling of overlapping speech. The platform’s architecture integrates an automated examiner, a streaming orchestrator enabling low-latency bidirectional audio, and a model-agnostic interface that accommodates both commercial APIs and open source systems (Lin et al., 9 Oct 2025).
1. Framework Architecture and Streaming Protocol
FDB‑v2’s architecture is organized around three tightly integrated components:
- Automated Examiner: A spoken LLM (e.g., gpt-realtime) that drives the evaluation session. The Examiner issues spoken instructions, manages staged sub-goals, initiates mid-turn interruptions, and dynamically adapts pacing strategies. Two pacing modes are implemented—Fast (allowing intervention during the Evaluatee's utterance) and Slow (waiting for clear turn endpoints)—to probe different aspects of conversational dynamics.
- Streaming-Oriented Orchestrator: Facilitates real-time audio routing through two persistent WebRTC channels (one per agent). The audio interface uses a strict chunked wire format: 48 kHz, 16-bit, mono PCM, segmented into 10 ms frames (each 960 bytes), imposing precise, uniform timing for both input and output streams.
- Model Adaptation: New models are “wrapped” with lightweight adapters converting between native and canonical audio formats for seamless integration. This ensures extensibility and low-latency interaction regardless of model backend.
By leveraging a standardized, open-sourced streaming protocol, FDB‑v2 guarantees reproducible, application-agnostic, and synchronously timed audio delivery and reception.
2. Task Family Coverage and Scenario Design
FDB‑v2’s evaluation is structured around four task families, each mapped to a class of real-world conversational challenges:
Task Family | Core Scenario | Evaluation Focus |
---|---|---|
Daily | Scheduling, ordering, reservations | Multi-turn fluency, natural goal progression |
Correction | Self-repairs, re-specification mid-utterance | Correction detection, semantic state updating |
Entity Tracking | Anaphora/pronoun and ordinal/categorical reference | Entity carry-over, cross-turn consistency |
Safety | Hazardous/sensitive requests (privacy, legal, health) | Policy adherence, safe refusals, redirection |
Each multi-turn session is structured into stepwise subgoals, which the Examiner enforces. The Examiner may proactively interrupt, elicit clarification, or challenge the Evaluatee’s state tracking (e.g., changing reservation details mid-conversation). This staged approach systematically stresses the system’s memory, correction handling, and turn-coordination capabilities.
3. Evaluation Metrics and Scoring
FDB‑v2 employs a composite metric suite that captures temporal, semantic, and task-specific system performance:
- Turn-Taking Fluency (TT): Each turn is labeled and rated (1–5) on naturalness of handoffs, overlap management, and latency. The rating reflects absence of awkward delays, poorly timed responses, or excessive interruptions.
- Instruction Following (IF): Per-turn rating (1–5) on accurate execution of Examiner strategies—confirmation, context carry-over, and stage-specific goal fulfiLLMent.
- Task-Specific Competence: For Correction, Entity Tracking, and Safety families, a global 1–5 score quantifies the end-to-end effectiveness on correction integration, persistent entity state, or policy compliance.
Transcripts generated by a streaming ASR model (e.g., Parakeet-TDT) are post-processed and scored by an LLM judge using a JSON schema, for structured breakdowns by turn and overall task. Evaluator prompts encode explicit criteria for each task family, ensuring scores are calibrated and interpretable.
Canonical wire format and evaluation schema:
- Audio: , mono, frames (960 bytes)
- LLM-judge output schema:
1 2 3 4 5 6
{ "Turn-taking event and score": [ [ {start_time}, {end_time} ]: {TT score}, {IF score}, ... ], "Task-specific score": {global score} }
This modular structure supports comprehensive, per-turn quantitative analysis and aligns the benchmark with both research and deployment diagnostics.
4. Empirical Findings and Failure Modes
Systematic application of FDB‑v2 across commercial and open-source full-duplex dialogue models has revealed several characteristic failure modes:
- Simultaneous Speech Confusion: Models often get confused under overlapping speech; symptoms include hesitation, double responses, or lost synchronization—especially in Fast pacing (forced mid-turn handoff).
- Correction Processing: Many systems fail to accurately detect corrections or to propagate corrected state in subsequent turns. This can yield “semantic drift” or persistent errors despite repeated user restatements.
- Entity Tracking Instabilities: Models frequently misresolve pronouns or lose referent attributes across turns, particularly when multiple or indirect references are used (e.g., “the blue one near the park” may be conflated or lost).
- Multi-Turn Temporal Degradation: Performance on both instruction following and turn-taking fluency may degrade over the course of longer dialogues, due to error accumulation or failure to recover from mid-sequence disruptions.
- Safety and Policy Compliance: Under overlapping or rapid-fire queries about prohibited topics, models are often slow to disengage or produce inconsistent rejections.
These insights delineate the specific aspects of full-duplex interaction, such as overlap resolution and cross-turn memory, that require further system-level innovation.
5. Extensibility, Integration, and Community Adoption
The FDB‑v2 protocol is expressly designed for extensibility:
- Model Integration: Any dialogue model, whether commercial or OSS, can be assessed provided it implements the canonical audio adapter. This supports benchmarking of emerging models on consistent criteria.
- Task Expansion: New task families can be introduced by scripting additional staged goal sequences in Examiner; the evaluation logic (prompts, labeling, scoring) is modular and open for extension.
- Deployment Context Flexibility: Because audio routing is streaming-native and not file-based, FDB‑v2 naturally supports real-time, low-latency evaluation in both laboratory and field settings.
This architecture makes FDB‑v2 suitable for both rapid prototyping of dialog agents and for systematic, reproducible cross-model comparison in research and industrial settings.
6. Standardized Protocol and Technical Details
FDB‑v2 relies on a standardized, open-sourced streaming protocol:
- Persistent WebRTC Channels: Bidirectional, low-latency audio transmission.
- Canonical Audio Specification: 48 kHz mono PCM, 10 ms (960-byte) strict frames.
- Adapters: Wrappers convert native model output to the canonical format, achieving uniformity regardless of model-specific audio handling.
- ASR and JSON LLM-Judge Interface: Continuous alignment of transcript with ASR enables robust per-turn and cross-turn evaluation, while JSON-based LLM-judging supports automated, explainable scoring.
The technical design enables reproducible measurement and precise diagnostic feedback, facilitating both fair benchmarking and targeted system-level improvement.
7. Conclusion
Full-Duplex-Bench-v2 is a comprehensive, extensible, and streaming-aware evaluation framework for multi-turn, full-duplex speech dialogue systems. Its core contribution is a reproducible, open infrastructure for scoring real-time instructional, correctional, and entity tracking competencies, with fine-grained temporal metrics and robust scenario diversity. As full-duplex dialogue agents move toward widespread deployment, FDB‑v2 establishes a rigorous empirical basis for tracking progress and uncovering persistent failure modes in the domain of open-ended, human-like conversational AI (Lin et al., 9 Oct 2025).