ClarifyMT-Bench Evaluation Suite

Updated 31 December 2025

ClarifyMT-Bench is a benchmark suite that evaluates multi-turn clarification in LLMs using a five-dimensional ambiguity taxonomy and simulated user personas.
It employs a hybrid LLM–human pipeline to generate 6,120 controlled multi-turn dialogues, assessing when models should ask clarifying questions versus answering directly.
The associated ClarifyAgent framework decomposes clarification into perception, forecasting, tracking, and planning, yielding significant robustness gains across ambiguity types.

ClarifyMT-Bench is a benchmark suite designed for systematic evaluation of multi-turn clarification behavior in conversational LLMs. Unlike existing clarification datasets that focus on single-turn exchanges or assume cooperative user input, ClarifyMT-Bench grounds its scenario design in a five-dimensional taxonomy of ambiguity and simulates six distinct user personas characterizing realistic and adversarial interaction patterns. The corpus comprises 6,120 multi-turn dialogues generated via a hybrid LLM–human pipeline, enabling controlled assessment of when LLMs should ask clarifying questions, when they should provide answers, and how ambiguity is navigated across realistic dialogue flows. The benchmark uncovers a pervasive under-clarification bias—models tend to answer prematurely and degrade in performance with increasing dialogue depth. To address these limitations, the associated ClarifyAgent framework decomposes clarification into modular processes of perception, forecasting, tracking, and planning, yielding substantial robustness gains across ambiguity types and user persona behaviors (Luo et al., 24 Dec 2025).

1. Ambiguity Taxonomy and Slot-Based Dialogue Modeling

ClarifyMT-Bench introduces a principled five-dimensional ambiguity taxonomy to characterize sources of uncertainty in multi-turn, open-domain LLM interactions:

Linguistic Ambiguity: Unclear surface forms, including lexical (e.g., polysemy), syntactic (multiple parses), and semantic ambiguities.
Intent Ambiguity: Under-specified, conflicting, or ambiguous user goals or requested scopes.
Contextual Ambiguity: Vagueness in referents, time, location, or other discourse context elements.
Epistemic Ambiguity: Gaps in shared knowledge, unfamiliar references, and value-laden requests.
Interactional Ambiguity: Noisy, factually-wrong, contradictory, or off-topic user input across turns.

Dialogue state is grounded via slot-filling notation. For a set of slots $S = \{s_1,...,s_n\}$ , model state at turn $t$ is $x_t = [\,f_t(s_1),...,f_t(s_n)\,]$ with $f_t(s_i)\in\{unfilled, filled, conflict\}$ . The optimal agent action $a_t\in\{\text{Clarify}, \text{Answer}\}$ seeks to maximize slot completion with minimal conflict, subject to the stopping rule: $\text{Stop if}~\Bigl(\forall s\in S^*,\,f_t(s)=\text{filled}\Bigr)\land\Bigl(\nexists s'\in S,\,f_t(s')=\text{conflict}\Bigr).$

This layered approach allows evaluation and synthesis methods to be precisely targeted at each ambiguity subtype and dialogue state.

2. Simulated User Personas and Dialogue Construction

The benchmark defines six canonical user personas to probe LLM robustness against realistic and challenging input patterns:

Precise: Fully resolves the slot with exact information in one turn.
Partial–Vague: Supplies hints but remains underspecified.
Off–Focus: Responds with tangential or unrelated information.
Contradictory: Provides internally or cross-turn conflicting responses.
Factually–Wrong: Offers plausible yet incorrect content.
Refusal: Declines to provide specifics or insists the system decide.

Dialogue corpus construction consists of two stages. In Stage 1, ambiguous single-turn queries and clarifiers are produced via diverse LLM sources (GPT-4.1, GPT-5, DeepSeek-V3, CLAMBER) and deduplicated by semantic similarity and human annotation. Stage 2 simulates multi-turn dialogues for each ambiguity-persona pair, leveraging LLM generation constrained by persona and subtype, with expert validation (Cohen’s $\kappa=0.598$ , $P_o=0.8627$ ). The final corpus comprises 6,120 dialogues ( $\sim$ 2.67 turns/dialogue), balanced by persona and model source.

3. Evaluation Protocols and Empirical Outcomes

ClarifyMT-Bench supports two core evaluation axes:

Ask-vs-Answer Decision Accuracy: At each turn $t$ , models must predict the correct action $\hat a_t\in\{\text{Clarify}, \text{Answer}\}$ (ref. label $y_t$ ). Primary metric is overall accuracy,

$Acc = \frac{1}{N}\sum_{t=1}^N \mathbb{I}[\hat a_t=y_t],$

alongside Under-Clarify and Over-Clarify counts.

Clarifying Question Quality: For sampled ambiguity subtypes, relevance and helpfulness are rated by GPT-4.1 judge (0–5 scale, verified by human–LLM Pearson $r=0.658$ ).

Principal findings:

Under-Clarification Bias: All tested models favor premature answering (insufficient clarification).
Dialogue Depth Degradation: Accuracy falls from $\sim$ 80–90% (first turn) to $\sim$ 45–65% (second) and $\sim$ 20–80% (third), depending on model and persona.
Persona Effects: Best accuracy on Off–Focus, lowest on Refusal; GPT-4.1 ranges from 40.7% (Refusal) to 89.1% (Off–Focus).
Model Scale & Alignment: Larger, alignment-focused models (Gemini-2.5, Claude-4.5) outperform smaller LLMs; chain-of-thought prompting alone is insufficient for robust clarification.

Explicit, fact-seeking ambiguity subtypes yield higher clarifier scores than open-ended intent or value ambiguities.

4. ClarifyAgent: Modular Agentic Clarification Framework

ClarifyAgent reframes clarification via four explicit modules, each mapping to a distinct functional component:

Perception (Perceiver): Extracts slot values, detects inconsistencies; categorizes $f_t(s_i)$ .
Forecasting (Forecaster): Infers user persona, estimates cooperativeness.
Tracking (Tracker): Maintains finite-state slot progression, determines Required Slot Completion (RSC).
Planning (Planner): Executes reasoning over FSM state and persona, chooses $\text{Clarify}$ (targeted question) or $\text{Answer}$ (when RSC or Refusal).

The main ClarifyAgent loop, distilled from the paper, is:

for t in dialogue:
    slots    = Perceiver.extract(dialogue_prefix)
    persona  = Forecaster.infer(dialogue_prefix)
    Tracker.update(slots)
    if Tracker.RSC_met() or persona == Refusal:
        action = "Answer"
    else:
        action = "Clarify"
        target = Planner.select_slot(slots, persona)
    response = Output.generate(action, target, dialogue_prefix)
    dialogue_prefix.append(response)

By making slot state and user behavior explicit per turn, ClarifyAgent mitigates under-clarification and prevents unproductive clarification loops with adversarial input.

5. Quantitative Gains and Ablation Insights

ClarifyAgent was evaluated (zero-shot, prompting-only) on open-source backbones (Llama-3.1-8B-It, Qwen-2.5-7B-It), compared to strong baseline strategies (Majority Voting, CoT, Intent-Sim, AT-CoT). All approaches used five inference passes per decision. Performance results:

Llama-3.1-8B-It: Baseline 71.2%, AT-CoT 73.0%, ClarifyAgent 88.4% (+15.4 absolute)
Qwen-2.5-7B-It: Baseline 57.9%, CoT 66.4%, ClarifyAgent 88.0% (+21.6 absolute)

Improvements are statistically significant ( $p<0.05$ ), especially for noisy personas (Partial–Vague, Off–Focus, Contradictory, Factually-Wrong). Ablation analysis demonstrates that removal of any module (Perceiver, Forecaster, Tracker, Planner) results in both lower average accuracy and increased variance, underscoring the criticality of each component for balanced, robust clarification.

6. Implications for LLM Dialogue and Future Extensions

ClarifyMT-Bench establishes a reproducible, interpretable experimental bedrock for examining how LLMs manage ambiguity and clarification in multi-turn, open-domain settings. Key implications:

Slot-Based State Modeling: Explicit slot-value tracking and persona inference are essential for robust ask-vs-answer decisions, particularly in the face of uncooperative or adversarial user input.
Agentic Modularity: Separating dialogue perception, forecasting, tracking, and planning enables interpretable model behavior and facilitates incorporation of external resources (e.g., knowledge bases, multimodal input streams).
System Design: Dynamic uncertainty tracking and user modeling should be integral to human–LLM systems, preventing premature answers and infinite clarification cycles.

A plausible implication is that future work in conversational AI must focus on both deeper modeling of interactional uncertainty and integrating agentic reasoning structures akin to ClarifyAgent to achieve reliable, contextually-appropriate clarification under real-world conditions (Luo et al., 24 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ClarifyMT-Bench.