τ²-Bench: Dual-Control Conversational AI

Updated 6 February 2026

τ²-Bench is a dual-control conversational AI benchmark that models decentralized environments where an agent and a user simulator independently operate.
It formalizes Telecom support as a Dec-POMDP with disjoint tool access, enabling precise state inference, task verification, and varied solution steps.
The benchmark’s compositional task generation and ablation studies reveal performance challenges in shifting from single to dual control, guiding future AI enhancements.

$τ^2$ -Bench defines a dual-control conversational AI benchmark in which both an AI agent and a user—represented by a simulator—share and independently manipulate a partially observed, dynamic environment. This structure addresses the limitations of prior single-control benchmarks, where only the agent could operate tools and the user played a passive role. $τ^2$ -Bench models domains such as Telecom technical support as decentralized partially observable Markov decision processes (Dec-POMDPs) with two players, providing precise control, rigorous verifiability, and compositional task diversity suitable for comprehensive agent evaluation (Barres et al., 9 Jun 2025).

1. Decentralized Dual-Control Environment Formalism

$τ^2$ -Bench operationalizes the environment as a Dec-POMDP defined by the tuple $(N, S, \{A_i\}_{i=1}^N, \{O_i\}_{i=1}^N, T, Z, R, γ)$ , with $N=2$ for the agent and user roles. The state space $S$ is the product of the internal, “private” databases ( $S_\text{agent\_DB}$ for the agent, $S_\text{user\_DB}$ for the user) and an interaction history log ( $S_\text{history}$ ), capturing all previous messages, tool invocations, and observations.

Action spaces are disjoint unions of natural language message functions ( $M$ ) and tool-call APIs. For the agent, these tools (e.g., get_customer_by_id, enable services) modify or query the CRM or system state, whereas for the user, tools reflect phone-mocking capabilities such as toggling airplane mode or querying network status. Only one player acts per turn. The transition function $T$ produces the next state and player-specific observations (tool feedback, messages). The global reward $R$ assigns $1$ if all user-facing assertions are satisfied upon episode termination and $0$ otherwise. The discount $\gamma$ is set to $1$, framing each conversation as a finite-horizon episode.

2. Agent and User Tooling Paradigms

The benchmark strictly partitions tool access between the agent and the user. Agent tools are administrative APIs on CRM or telecom services, enabling state changes relevant to support workflows. User tools simulate actions a real user could perform on their device (e.g., toggling settings, retrieving status). Each player, on their action turn, may issue a message or invoke a tool call operating on their respective private database and immediately receive the resulting observation. This model necessitates robust state inference and coordination: agents must not only execute direct interventions but also elicit and guide user actions as required by the distributed control setting.

3. Compositional Task Suite Generation

Task creation in $τ^2$ -Bench is grounded in programmatic composition of atomic "subtasks." Subtasks are grouped, with each group consisting of mutually exclusive or alternative primitives, formalized as ordered sequences:

$f^{init}$ : functions that initialize the environment (e.g., set initial device or service state),
$f^{sol}$ : permitted solution steps (agent/user tool calls required for task resolution),
$f^{assert}$ : assertions checked on the final DB state for correctness.

Tasks are generated by taking Cartesian products across all subtask groups and concatenating their respective initialization, solution, and assertion stages. Only sequences passing a deterministic verification step are retained. To ensure coverage and difficulty control, tasks are balanced by the number of subtasks or solution steps. In the Telecom domain, 15 subtask groups yield 2,285 possible combinations; 114 are sampled to ensure distributed complexity across tasks comprising 2–9 subtasks.

Component	Description	Example
$f^{init}$	Set up world state	set_airplane_mode(True)
$f^{sol}$	Solution step (agent/user tool call)	toggle_airplane_mode
$f^{assert}$	Final-state assertion for reward	assert_service_status("connected")

4. User Simulator Characteristics and Constraints

The user simulator is an LLM-driven function-calling agent constrained by the following design principles:

Only user-DB API tools are accessible; all responses are in human-readable form.
The simulator is prompted to proceed stepwise according to the task sequence and avoids anticipatory planning or tool use not grounded in explicit agent instruction or required for constructing a response.
Tool invocations by the simulator are permitted only upon explicit agent request or necessity for replying.
All decisions are firmly based on the present $S_\text{world.user}$ state.

This tightly coupled approach reduces behavioral variability. In the Telecom domain, the user simulator showed a 16% total error rate (with 6% critical), a substantial accuracy improvement over prior single-control environments—where comparable user simulators yielded 40–47% error rates.

5. Evaluation Protocol and Metrics

$τ^2$ -Bench extends the PASS $^k$ metric family to dual-control evaluation. For each task, $k$ temperature-0 independent runs are conducted:

$\text{PASS}^k = \frac{1}{|T|} \sum_{\tau \in T} \mathbb{1}[\text{success in }\geq 1\text{ of }k\text{ runs}(\tau)]$

A task is deemed solved when all $f^{assert}$ return true. Granular performance breakdowns are also computed, including:

Action matching: whether required $f^{sol}$ calls occurred,
Status assertions: correctness of intermediate user/system status,
Natural-language assertions: accuracy of state communication.

Three ablation settings permit detailed analysis:

Base (dual-control): both players operate their own tools,
Solo: the agent controls all tools, with the user stubbed out, isolating pure reasoning/tool use,
GT-plan: the agent is provided a complete correct solution plan and must only coordinate execution.

Experimental results demonstrate significant performance degradation when agents transition from single-control (solo) to dual-control (base)—e.g., GPT-4.1’s PASS $^1$ drops from approximately 74% (retail/airline) to 34% (Telecom). Solo to base translation leads to an 18% PASS $^1$ reduction (GPT-4.1) and 25% for O4-mini. These results isolate the increased demands of effective communication and decentralized control.

6. Domain Coverage, Complexity Control, and Implications

The procedural generation of diverse, verifiable tasks enables systematic exploration across domain variations and solution complexities. The benchmark quantifies not only pure reasoning competency but also the additional challenges introduced by collaborative user guidance. The compositional approach guarantees task correctness and coverage, while ablations distinguish failures by causal mechanism—reasoning errors (missed or incorrect tool calls) versus communication/coordination errors (failure to instruct or synchronize user actions).

A plausible implication is that methods excelling in single-control environments may not generalize to collaborative problem-solving domains without architectural or training adaptation to handle the nonlinear interaction of decentralized tool use, state inference, and instruction. $τ^2$ -Bench thus provides a rigorously controlled environment for advancing research in multi-agent conversational scenarios with real-world analogs in technical support and collaborative human-computer interaction (Barres et al., 9 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to $τ^2$ Bench.