ClarifyAgent: Modular Multi-Turn Clarification

Updated 31 December 2025

ClarifyAgent is a modular system that decomposes the clarification process into perception, forecasting, tracking, and planning.
It mitigates under-clarification biases by dynamically selecting between clarifying questions and direct answers in multi-turn dialogues.
Evaluation on ClarifyMT-Bench demonstrates significant accuracy improvements over baseline LLM prompting under noisy, adversarial conditions.

ClarifyAgent is a modular, agentic reasoning system for multi-turn clarification in conversational AI, designed to robustly resolve ambiguities in open-domain interactions. It addresses systematic under-clarification biases present in standard LLMs when faced with vague, incomplete, or adversarial user responses. ClarifyAgent operates as a structured agent that decomposes the clarification process into distinct modules—perception, persona forecasting, slot tracking, and planning—executed via sequential LLM prompting without additional network training. Its development is tightly integrated with ClarifyMT-Bench, a large-scale benchmark facilitating reproducible evaluation of clarification behaviors across diverse ambiguity types and simulated user personas (Luo et al., 24 Dec 2025).

1. Motivation and Problem Landscape

ClarifyAgent addresses persistent failure modes in deployed conversational assistants where users provide under-specified, contradictory, or misleading input over multiple dialogue turns. Prior LLM-based systems (GPT-4.1, Gemini, Claude, etc.) exhibit significant "under-clarification bias": they answer prematurely rather than eliciting necessary clarifications, with marked performance degradation as dialogue depth increases (accuracy drops from ~90% to ~55–80% between turn 1 and turn 3 on ClarifyMT-Bench). This problem is exacerbated under non-cooperative user personas, whereas existing benchmarks measure only single-turn, cooperative cases and lack mechanisms to quantify when agents should ask versus answer (Luo et al., 24 Dec 2025).

ClarifyMT-Bench provides 6,120 multi-turn synthetic dialogues structured around a five-dimensional ambiguity taxonomy (Linguistic, Intent, Contextual, Epistemic, Interactional) and six cooperative-to-adversarial simulated user personas. This enables detailed assessment of decision accuracy, robustness to interactional noise, and clarifying-question quality (Tables 3–5, Figures 2–4).

2. Modular Architecture and Reasoning Loop

ClarifyAgent is defined as a perception–reasoning–action agent comprising four core modules:

Module	Function	Output
Perceiver	Extracts slot candidates, labels slot status	$f_t(s_i)\in$ {unfilled, filled, conflict}
Forecaster	Infers user persona $p$	Persona $\in$ {Precise, ..., Refusal}
Tracker	Maintains finite-state slot memory $x_t$	$x_t = [f_t(s_1),…]$
Planner	Integrates $x_t$ and $p$ , chooses action	$a_t \in$ {Clarify, Answer}

The operational loop for each dialogue turn:

Perceiver processes the current user utterance to extract and label ambiguity-relevant slots.
Tracker updates internal slot state memory (unfilled, filled, conflict) for all slots based on incoming and prior data.
Forecaster classifies the user into a behavioral persona (Precise, Partial–Vague, Off–Focus, Contradictory, Factually–Wrong, Refusal) via LLM-prompted classification.
Planner integrates slot status and persona to select the optimal action: issue a clarifying question ( $a_t=$ Clarify) or provide an answer ( $a_t=$ Answer).
Output module formats and emits either a targeted clarifying question or the final answer.

Pseudocode for slot status tracking:

1
2
3

if new value for s_i matches previous filled ⇒ no change
else if contradicts ⇒ f_t(s_i) = conflict
else if supplies new info ⇒ f_t(s_i) = filled

Stopping criterion (Eq. 3): halt clarification when ∀

s∈S^∗ : f_t(s)=

filled and no remaining conflicts.

3. Formalization and Decision Policy

Let $S = \{ s_1, \ldots, s_n \}$ be all ambiguity-relevant slots, with $S^*\subseteq S$ as unknown required slots. The agent maintains state $x_t = [f_t(s_1), \ldots, f_t(s_n)]$ , $f_t(s_i) \in \{$ unfilled, filled, conflict $\}$ . The policy for selecting $a_t$ is:

$a_t = \begin{cases} \text{Clarify} & \text{if any missing or conflict slots remain} \ \text{Answer} & \text{otherwise} \end{cases}$

Forecasting is realized as $P(\text{persona} = p~|~\text{context})$ via zero/few-shot LLM prompts. Planning aims to maximize correct action selection versus ground truth $y_t$ , minimizing false clarification (over-clarification) and missed clarification (under-clarification) errors.

4. Implementation Details

All modules operate atop backbone LLMs (Llama-3.1-8B-It, Qwen-2.5-7B-It) in greedy decoding mode ( $T=0$ ), managed by vLLM or closed-source API. Each reasoning component is realized by a tailored, zero/few-shot prompt consuming the normalized, structured dialogue state (JSON-style input). No model fine-tuning; all behaviors leverage the pretrained LLM’s internal reasoning. Simulated user personas are generated by LLM prompting, cross-validated by human annotators (Cohen’s $\kappa=0.60$ ).

5. Evaluation on ClarifyMT-Bench

Performance is measured on all 6,120 multi-turn dialogues (average 2.67 turns per dialogue), with primary metric being turn-level decision accuracy ( $\mathbb{1}[\hat{y}_t = y_t]$ ). Secondary metrics include under-clarification bias (incorrect premature answers), over-clarification bias (excessive questioning), persona-specific breakdown, ambiguity subtype analysis, and dialogue depth robustness. Question quality is assessed by both LLM-as-Judge (GPT-4.1 scores, scale 0–5) and human evaluation, with correlation $r=0.66$ between assets (Figures 3–4).

Key baseline findings:

Model	Turn 1 Accuracy	Turn 3 Accuracy
GPT-4.1	~90%	~80%
Gemini-2.5	~80–90%	~55–80%

ClarifyAgent achieves major gains over prompting baselines:

Backbone	Prompt-only	ClarifyAgent	Improvement
Llama-3.1-8B	71.2%	88.4%	+17 pp
Qwen-2.5-7B	57.9%	88.0%	+30 pp

Improvement is especially strong on noisy/adversarial personas (Partial–Vague, Off–Focus, Contradictory, Factually–Wrong). Each turn requires 5 LLM forward passes (per module), matching sample-based baselines in computational cost.

Module ablation effects:

Removed Module	Accuracy Drop	Variance Increase
Perceiver	–5–10 pp	×2
Forecaster	–5–10 pp	×2
Planner	–5 pp	forced over/under-clarification

6. Key Insights, Comparative Advantages, and Limitations

ClarifyAgent’s explicit decomposition (perception, forecasting, slot tracking, planning) results in high robustness under ambiguous and adversarial dialogue regimes. Persona inference modulates when the agent should interrupt with clarifying questions versus answer directly, reducing both unnecessary interaction and premature answers in noisy contexts. The explicit finite-state machine tracker prevents hallucination of missing information by maintaining a clear ambiguity state. The agentic modular reasoning framework is agnostic to backbone model—gains are observed on both open-source and closed-source LLMs.

Limitations remain: the fixed slot taxonomy restricts granularity for highly dynamic domains; module policies are based on prompt design and not end-to-end learned representations; multi-modal context and adaptive stopping could further improve clarification fidelity.

7. Future Directions and Research Implications

ClarifyAgent opens new research frontiers for clarificatory agent design in real-world settings. Potential extension areas include:

Dynamic slot taxonomy expansion for novel ambiguity categories
Learning module-level policies end-to-end using RL from human feedback
Incorporating multimodal sensory data for richer contextual disambiguation
Empirical studies of long-horizon clarification in real user interaction
Benchmarking clarificatory skill as a standard for conversational agent evaluation

The systematic agentic approach demonstrated by ClarifyAgent establishes a reproducible baseline for when LLMs should ask versus answer and how agents should navigate ambiguity in real-world, multi-turn human-LLM interactions (Luo et al., 24 Dec 2025).

References

"ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational LLMs" (Luo et al., 24 Dec 2025)

PDF Markdown Chat (Pro)

References (1)

ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ClarifyAgent.