TOD-ProcBench: Process-Oriented Dialogue Benchmark

Updated 27 November 2025

TOD-ProcBench is a benchmark that assesses large language models’ ability to follow complex, multi-level if–then process instructions in task-oriented dialogues.
It employs both hierarchical and flattened instruction formats to evaluate models on instruction retrieval, violation detection, and compliant response generation.
The benchmark leverages extensive, multilingual instruction–conversation pairs to rigorously test LLMs’ reasoning and adherence to nuanced conditional workflows.

TOD-ProcBench is a public benchmark developed to systematically evaluate LLMs' (LLMs) abilities to understand, retrieve, and precisely execute multi-level, fine-grained natural language process instructions in multi-turn task-oriented dialogues (TODs) (Ghazarian et al., 20 Nov 2025). Unlike previous TOD evaluation schemes that simplify instructions into slot–intent API schema, TOD-ProcBench focuses on the intricate, conditional workflows that real-world service agents must follow. Process instructions are encoded as hierarchical or flattened “If–Then” statements, providing rigorous conditions for each conversational action. Benchmark tasks span instruction retrieval, violation detection, and compliant response generation—offering a comprehensive lens on the true capabilities and limitations of modern LLMs in complex, instruction-following TOD scenarios.

1. Construction and Scope of TOD-ProcBench

TOD-ProcBench is built on the ABCD dataset [Chen et al., 2021], extracting 55 user intents and aggregating approximately 146 human–human dialogues per intent. For each intent, one instruction document is generated in English using one-shot LLM prompting and subsequent human refinement. Human quality control yielded 769 high-quality instruction–conversation pairs from 1,004 tested, achieving 82% annotation “accuracy” and only 15% judged as “missing” content after majority vote. Format and multilingual variants extend coverage: the benchmark offers three text formats (nested If–Then, flattened If–Then, and flattened JSON) and translations into seven languages (EN, AR, ZH, FR, DE, HI, ES), resulting in approximately 16,137 conversation–instruction examples.

Each instruction document in its most expressive form employs up to four-level nested If–Then logic (e.g., for “return-wrong-size” intents), supporting the formulation of nuanced, practical workflows. LLM-generated instructions are validated both by human annotation and by LLM judges for format correctness and translation fidelity (98.3–99.8%).

2. Process Instruction Model and Formalism

TOD-ProcBench formalizes instruction documents as sets of multi-level “Condition–Action” statements:

Let $C$ be atomic conditions and $A$ atomic actions; each instruction $n \in N$ (where $N$ is the set of documents) is a collection $s_i: \text{if}\ \mathcal{C}_i\ \text{then}\ \mathcal{A}_i$ , with $\mathcal{C}_i \subseteq C$ and $\mathcal{A}_i \in A^*$ .
Hierarchies manifest when an action itself triggers further If–Then branches.
Instruction composition employs logical operations such as And, Or, Chain, Selection, and Nesting. For flattening, one conjuncts intermediate conditions, yielding sets $\{(C^n, A^n)\}$ where $C^n = C_1 \wedge \dots \wedge C_k$ .
Three representational formats are provided:
- $f_1$ : Nested If–Then (human-readable)
- $f_2$ : Flattened If–Then (conjunctive, one-level)
- $f_3$ : Flattened JSON (structured tuples)

This explicit modeling enables the systematic probing of LLM capacity for multi-constraint, conditional action planning within realistic dialogues.

3. Task Suite and Evaluation Metrics

TOD-ProcBench defines three principal tasks, each exploiting instruction–conversation pairs and employing chain-of-thought (CoT) prompting:

3.1 Task 1: Instruction Retrieval and Next-Action Prediction

Given a partial conversation $u$ and full instruction $n$ , models must:

Retrieve the top- $k$ relevant instruction statements $S_k(u) \subset n$ .
Predict the next single action $a^* \in A$ .

Metrics:

Instruction retrieval accuracy@k:

$\mathrm{Acc}_{\mathrm{inst}}^{(k)} = \frac{1}{|D|} \sum_{(u,n)\in D} [s_{\mathrm{gt}}\in S_k(u)]$

Next-action accuracy:

$\mathrm{Acc}_{\mathrm{act}} = \frac{1}{|D|}\sum_{(u,a_{\mathrm{gt}})\in D} [\hat a = a_{\mathrm{gt}}]$

3.2 Task 2: Instruction-Violation Detection

Given a conversational history $u$ , instruction $n$ , and agent response $r$ , models must classify $r$ as compliant (0) or violating (1). Violation examples are synthesized by perturbing gold instructions via parameter mismatches and action substitutions, then rewriting responses to obey altered instructions and verifying non-entailment.

Classifiers:

Direct compliance (given $u$ , $n$ , $r$ as input)
Instruction-entailment (recover relevant $n_r$ from $(u,r)$ , then check entailment with the original instruction)

Metric: binary classification accuracy.

3.3 Task 3: Conditional Generation of Compliant Agent Responses

Given history $u$ and instruction $n$ , generate the next agent response $\hat{r}$ that is fully compliant. Prompts require step decomposition, explicit next-action selection, and concise, non-hallucinated output.

Metrics:

Compliance rate (fraction of outputs judged compliant by Claude 3.7 judge, denoted $\textrm{Comp}$ )
BLEU/ROUGE (optional; benchmark focuses on compliance).

4. Experimental Protocol and Model Benchmarks

LLMs are evaluated in few-shot, in-context CoT settings (3 exemplars) without fine-tuning. Models include Qwen3-14B, Llama 3.3-70B, Gemma 3-27B-IT, Claude 3.5-Sonnet-V1/V2, and Claude 3.7-Sonnet. Format ( $f_1$ / $f_2$ / $f_3$ ) and language ablations isolate the effects of structural and linguistic variation.

Aggregate Results Table

Task	Best Model (Accuracy/Rate)	Format Sensitivity	Language Sensitivity
Task 1	Claude 3.7: 0.395/0.431	f₁ marginally best	All 7 lang. within ±0.02 acc
Task 2	Claude 3.7: 0.761 (direct), 0.730 (entailment)	f₁ best	Little variation
Task 3	Claude 3.7: 95–97%	Format-dependent; smaller models struggle	Larger LLMs robust; Llama 3.3 varies (14–57%)

Performance on instruction retrieval and violation detection remains well below 0.5 for most models, with best-case conditional generation compliance reaching high 90%s only in top-tier models. A plausible implication is that surface-level output compliance is easier to achieve via CoT prompting than authentic reasoning over complex instruction logic.

5. Qualitative Findings and Error Characterization

Fine-grained analysis reveals characteristic LLM errors, such as confusing closely related conditions (e.g., “ask-for-membership” versus “verify-identity”) or failing to respect deeply nested action hierarchies. Figure 1 in the original paper demonstrates that simple workflow-based scoring can yield high false-compliance rates, while TOD-ProcBench's hierarchical instructions robustly identify nuanced violations. Qualitative ablations further show that smaller models often bypass format constraints or hallucinate actions not supported by instructions.

Some models, especially in non-English conversations or with non-default format variants, underperform due to poor structure adherence or limited grounding in condition–action dependencies.

6. Implications for Evaluation, Domain Generalization, and Future Research

TOD-ProcBench exposes significant gaps in current LLM instruction-following within the multi-turn dialogue setting. High compliance in conditional generation does not correlate with equally strong retrieval or violation-detection, indicating limits in tracking long-range, compositional instruction chains—a critical property for robust TOD deployment in real-world environments. Minimal performance variation across formats and languages indicates that future improvements must primarily address the models’ reasoning and grounding abilities rather than superficial formatting or translation robustness.

The benchmark’s formalism and open-source release under the Llama 3.3 Community License facilitate extensibility to new domains and the adoption of interactive, process-driven evaluation pipelines. Future directions outlined include the extension to codified programmatic instruction formats, integration of robust statistical testing, and exploration of retrieval-augmented and end-to-end finetuning strategies for enhanced instruction grounding (Ghazarian et al., 20 Nov 2025).

7. Relation to Interactive TOD Evaluation and Benchmark Evolution

The procedural rigor of TOD-ProcBench aligns with recent movements in interactive TOD evaluation, such as frameworks employing user simulators and closed-loop metrics to address policy mismatch and static evaluation limitations (Cheng et al., 2022). By focusing on condition–action hierarchies rather than traditional slot-filling, TOD-ProcBench positions itself as a natural progression from schema-guided or black-box interactive approaches, enabling highly reproducible, extensible, and fine-grained assessment of instruction adherence in TOD systems.

Collectively, TOD-ProcBench provides a foundation for future research on complex instruction-following, facilitating detailed error analysis, robust cross-system comparisons, and methodologically sound progress in task-oriented dialogue modeling.

PDF Markdown Chat (Pro)

References (2)

TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues (2025)

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TOD-ProcBench.