ClarifyAgent: Modular Multi-Turn Clarification
- ClarifyAgent is a modular system that decomposes the clarification process into perception, forecasting, tracking, and planning.
- It mitigates under-clarification biases by dynamically selecting between clarifying questions and direct answers in multi-turn dialogues.
- Evaluation on ClarifyMT-Bench demonstrates significant accuracy improvements over baseline LLM prompting under noisy, adversarial conditions.
ClarifyAgent is a modular, agentic reasoning system for multi-turn clarification in conversational AI, designed to robustly resolve ambiguities in open-domain interactions. It addresses systematic under-clarification biases present in standard LLMs when faced with vague, incomplete, or adversarial user responses. ClarifyAgent operates as a structured agent that decomposes the clarification process into distinct modules—perception, persona forecasting, slot tracking, and planning—executed via sequential LLM prompting without additional network training. Its development is tightly integrated with ClarifyMT-Bench, a large-scale benchmark facilitating reproducible evaluation of clarification behaviors across diverse ambiguity types and simulated user personas (Luo et al., 24 Dec 2025).
1. Motivation and Problem Landscape
ClarifyAgent addresses persistent failure modes in deployed conversational assistants where users provide under-specified, contradictory, or misleading input over multiple dialogue turns. Prior LLM-based systems (GPT-4.1, Gemini, Claude, etc.) exhibit significant "under-clarification bias": they answer prematurely rather than eliciting necessary clarifications, with marked performance degradation as dialogue depth increases (accuracy drops from ~90% to ~55–80% between turn 1 and turn 3 on ClarifyMT-Bench). This problem is exacerbated under non-cooperative user personas, whereas existing benchmarks measure only single-turn, cooperative cases and lack mechanisms to quantify when agents should ask versus answer (Luo et al., 24 Dec 2025).
ClarifyMT-Bench provides 6,120 multi-turn synthetic dialogues structured around a five-dimensional ambiguity taxonomy (Linguistic, Intent, Contextual, Epistemic, Interactional) and six cooperative-to-adversarial simulated user personas. This enables detailed assessment of decision accuracy, robustness to interactional noise, and clarifying-question quality (Tables 3–5, Figures 2–4).
2. Modular Architecture and Reasoning Loop
ClarifyAgent is defined as a perception–reasoning–action agent comprising four core modules:
| Module | Function | Output |
|---|---|---|
| Perceiver | Extracts slot candidates, labels slot status | {unfilled, filled, conflict} |
| Forecaster | Infers user persona | Persona {Precise, ..., Refusal} |
| Tracker | Maintains finite-state slot memory | |
| Planner | Integrates and , chooses action | {Clarify, Answer} |
The operational loop for each dialogue turn:
- Perceiver processes the current user utterance to extract and label ambiguity-relevant slots.
- Tracker updates internal slot state memory (unfilled, filled, conflict) for all slots based on incoming and prior data.
- Forecaster classifies the user into a behavioral persona (Precise, Partial–Vague, Off–Focus, Contradictory, Factually–Wrong, Refusal) via LLM-prompted classification.
- Planner integrates slot status and persona to select the optimal action: issue a clarifying question ( Clarify) or provide an answer ( Answer).
- Output module formats and emits either a targeted clarifying question or the final answer.
Pseudocode for slot status tracking:
1 2 3 |
if new value for s_i matches previous filled ⇒ no change else if contradicts ⇒ f_t(s_i) = conflict else if supplies new info ⇒ f_t(s_i) = filled |
3. Formalization and Decision Policy
Let be all ambiguity-relevant slots, with as unknown required slots. The agent maintains state , unfilled, filled, conflict. The policy for selecting is:
Forecasting is realized as via zero/few-shot LLM prompts. Planning aims to maximize correct action selection versus ground truth , minimizing false clarification (over-clarification) and missed clarification (under-clarification) errors.
4. Implementation Details
All modules operate atop backbone LLMs (Llama-3.1-8B-It, Qwen-2.5-7B-It) in greedy decoding mode (), managed by vLLM or closed-source API. Each reasoning component is realized by a tailored, zero/few-shot prompt consuming the normalized, structured dialogue state (JSON-style input). No model fine-tuning; all behaviors leverage the pretrained LLM’s internal reasoning. Simulated user personas are generated by LLM prompting, cross-validated by human annotators (Cohen’s ).
5. Evaluation on ClarifyMT-Bench
Performance is measured on all 6,120 multi-turn dialogues (average 2.67 turns per dialogue), with primary metric being turn-level decision accuracy (). Secondary metrics include under-clarification bias (incorrect premature answers), over-clarification bias (excessive questioning), persona-specific breakdown, ambiguity subtype analysis, and dialogue depth robustness. Question quality is assessed by both LLM-as-Judge (GPT-4.1 scores, scale 0–5) and human evaluation, with correlation between assets (Figures 3–4).
Key baseline findings:
| Model | Turn 1 Accuracy | Turn 3 Accuracy |
|---|---|---|
| GPT-4.1 | ~90% | ~80% |
| Gemini-2.5 | ~80–90% | ~55–80% |
ClarifyAgent achieves major gains over prompting baselines:
| Backbone | Prompt-only | ClarifyAgent | Improvement |
|---|---|---|---|
| Llama-3.1-8B | 71.2% | 88.4% | +17 pp |
| Qwen-2.5-7B | 57.9% | 88.0% | +30 pp |
Improvement is especially strong on noisy/adversarial personas (Partial–Vague, Off–Focus, Contradictory, Factually–Wrong). Each turn requires 5 LLM forward passes (per module), matching sample-based baselines in computational cost.
Module ablation effects:
| Removed Module | Accuracy Drop | Variance Increase |
|---|---|---|
| Perceiver | –5–10 pp | ×2 |
| Forecaster | –5–10 pp | ×2 |
| Planner | –5 pp | forced over/under-clarification |
6. Key Insights, Comparative Advantages, and Limitations
ClarifyAgent’s explicit decomposition (perception, forecasting, slot tracking, planning) results in high robustness under ambiguous and adversarial dialogue regimes. Persona inference modulates when the agent should interrupt with clarifying questions versus answer directly, reducing both unnecessary interaction and premature answers in noisy contexts. The explicit finite-state machine tracker prevents hallucination of missing information by maintaining a clear ambiguity state. The agentic modular reasoning framework is agnostic to backbone model—gains are observed on both open-source and closed-source LLMs.
Limitations remain: the fixed slot taxonomy restricts granularity for highly dynamic domains; module policies are based on prompt design and not end-to-end learned representations; multi-modal context and adaptive stopping could further improve clarification fidelity.
7. Future Directions and Research Implications
ClarifyAgent opens new research frontiers for clarificatory agent design in real-world settings. Potential extension areas include:
- Dynamic slot taxonomy expansion for novel ambiguity categories
- Learning module-level policies end-to-end using RL from human feedback
- Incorporating multimodal sensory data for richer contextual disambiguation
- Empirical studies of long-horizon clarification in real user interaction
- Benchmarking clarificatory skill as a standard for conversational agent evaluation
The systematic agentic approach demonstrated by ClarifyAgent establishes a reproducible baseline for when LLMs should ask versus answer and how agents should navigate ambiguity in real-world, multi-turn human-LLM interactions (Luo et al., 24 Dec 2025).
References
- "ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational LLMs" (Luo et al., 24 Dec 2025)