τ²-Bench: Dual-Control Conversational AI
- τ²-Bench is a dual-control conversational AI benchmark that models decentralized environments where an agent and a user simulator independently operate.
- It formalizes Telecom support as a Dec-POMDP with disjoint tool access, enabling precise state inference, task verification, and varied solution steps.
- The benchmark’s compositional task generation and ablation studies reveal performance challenges in shifting from single to dual control, guiding future AI enhancements.
-Bench defines a dual-control conversational AI benchmark in which both an AI agent and a user—represented by a simulator—share and independently manipulate a partially observed, dynamic environment. This structure addresses the limitations of prior single-control benchmarks, where only the agent could operate tools and the user played a passive role. -Bench models domains such as Telecom technical support as decentralized partially observable Markov decision processes (Dec-POMDPs) with two players, providing precise control, rigorous verifiability, and compositional task diversity suitable for comprehensive agent evaluation (Barres et al., 9 Jun 2025).
1. Decentralized Dual-Control Environment Formalism
-Bench operationalizes the environment as a Dec-POMDP defined by the tuple , with for the agent and user roles. The state space is the product of the internal, “private” databases ( for the agent, for the user) and an interaction history log (), capturing all previous messages, tool invocations, and observations.
Action spaces are disjoint unions of natural language message functions () and tool-call APIs. For the agent, these tools (e.g., get_customer_by_id, enable services) modify or query the CRM or system state, whereas for the user, tools reflect phone-mocking capabilities such as toggling airplane mode or querying network status. Only one player acts per turn. The transition function produces the next state and player-specific observations (tool feedback, messages). The global reward assigns $1$ if all user-facing assertions are satisfied upon episode termination and $0$ otherwise. The discount is set to $1$, framing each conversation as a finite-horizon episode.
2. Agent and User Tooling Paradigms
The benchmark strictly partitions tool access between the agent and the user. Agent tools are administrative APIs on CRM or telecom services, enabling state changes relevant to support workflows. User tools simulate actions a real user could perform on their device (e.g., toggling settings, retrieving status). Each player, on their action turn, may issue a message or invoke a tool call operating on their respective private database and immediately receive the resulting observation. This model necessitates robust state inference and coordination: agents must not only execute direct interventions but also elicit and guide user actions as required by the distributed control setting.
3. Compositional Task Suite Generation
Task creation in -Bench is grounded in programmatic composition of atomic "subtasks." Subtasks are grouped, with each group consisting of mutually exclusive or alternative primitives, formalized as ordered sequences:
- : functions that initialize the environment (e.g., set initial device or service state),
- : permitted solution steps (agent/user tool calls required for task resolution),
- : assertions checked on the final DB state for correctness.
Tasks are generated by taking Cartesian products across all subtask groups and concatenating their respective initialization, solution, and assertion stages. Only sequences passing a deterministic verification step are retained. To ensure coverage and difficulty control, tasks are balanced by the number of subtasks or solution steps. In the Telecom domain, 15 subtask groups yield 2,285 possible combinations; 114 are sampled to ensure distributed complexity across tasks comprising 2–9 subtasks.
| Component | Description | Example |
|---|---|---|
| Set up world state | set_airplane_mode(True) | |
| Solution step (agent/user tool call) | toggle_airplane_mode | |
| Final-state assertion for reward | assert_service_status("connected") |
4. User Simulator Characteristics and Constraints
The user simulator is an LLM-driven function-calling agent constrained by the following design principles:
- Only user-DB API tools are accessible; all responses are in human-readable form.
- The simulator is prompted to proceed stepwise according to the task sequence and avoids anticipatory planning or tool use not grounded in explicit agent instruction or required for constructing a response.
- Tool invocations by the simulator are permitted only upon explicit agent request or necessity for replying.
- All decisions are firmly based on the present state.
This tightly coupled approach reduces behavioral variability. In the Telecom domain, the user simulator showed a 16% total error rate (with 6% critical), a substantial accuracy improvement over prior single-control environments—where comparable user simulators yielded 40–47% error rates.
5. Evaluation Protocol and Metrics
-Bench extends the PASS metric family to dual-control evaluation. For each task, temperature-0 independent runs are conducted:
A task is deemed solved when all return true. Granular performance breakdowns are also computed, including:
- Action matching: whether required calls occurred,
- Status assertions: correctness of intermediate user/system status,
- Natural-language assertions: accuracy of state communication.
Three ablation settings permit detailed analysis:
- Base (dual-control): both players operate their own tools,
- Solo: the agent controls all tools, with the user stubbed out, isolating pure reasoning/tool use,
- GT-plan: the agent is provided a complete correct solution plan and must only coordinate execution.
Experimental results demonstrate significant performance degradation when agents transition from single-control (solo) to dual-control (base)—e.g., GPT-4.1’s PASS drops from approximately 74% (retail/airline) to 34% (Telecom). Solo to base translation leads to an 18% PASS reduction (GPT-4.1) and 25% for O4-mini. These results isolate the increased demands of effective communication and decentralized control.
6. Domain Coverage, Complexity Control, and Implications
The procedural generation of diverse, verifiable tasks enables systematic exploration across domain variations and solution complexities. The benchmark quantifies not only pure reasoning competency but also the additional challenges introduced by collaborative user guidance. The compositional approach guarantees task correctness and coverage, while ablations distinguish failures by causal mechanism—reasoning errors (missed or incorrect tool calls) versus communication/coordination errors (failure to instruct or synchronize user actions).
A plausible implication is that methods excelling in single-control environments may not generalize to collaborative problem-solving domains without architectural or training adaptation to handle the nonlinear interaction of decentralized tool use, state inference, and instruction. -Bench thus provides a rigorously controlled environment for advancing research in multi-agent conversational scenarios with real-world analogs in technical support and collaborative human-computer interaction (Barres et al., 9 Jun 2025).