ToolACE-MT: Efficient Multi-Turn Dialogue Generation
- ToolACE-MT is a non-autoregressive iterative framework that constructs multi-turn agentic dialogue data via a complete dialogue skeleton initialization and mask-and-fill refinement.
- It achieves enhanced performance by reducing API calls from 275k to 188k and improving offline verification rates from 61.1% to 72.3% over traditional MAS methods.
- The framework’s three-stage process—initialization, iterative refinement, and offline verification—ensures coherent, efficient data generation for tool-augmented LLMs.
ToolACE-MT is a non-autoregressive iterative generation framework for constructing high-quality multi-turn agentic dialogue data, particularly for tool-augmented LLM scenarios. The framework enables efficient generation of complex multi-step, multi-turn conversational trajectories, supporting function calling, agentic planning, and dynamic user-agent exchanges. ToolACE-MT proceeds through a pipeline that includes a coarse-grained dialogue skeleton initialization, iterative refinement via mask-and-fill operations, and a hybrid offline verification step, yielding superior data quality and efficiency compared to conventional multi-agent simulation (MAS) approaches (Zeng et al., 18 Aug 2025).
1. Motivation and Comparative Advantages
Traditional simulation-based agentic dialogue data generation—typically the Multi-Agent Simulation (MAS) paradigm—grows conversations stepwise via autoregressive LLM calls. MAS is characterized by high computational overhead, as each agent turn (user, assistant, tool) triggers a separate LLM invocation. This leads to high latency and inflated API usage, constraining scalability for large datasets or real-time simulation. MAS also suffers from implicit task complexity (inability to prescribe subtask or turn structure in advance) and limited global context, resulting in inconsistent tool usage and factual drift during generation.
In contrast, ToolACE-MT employs a non-autoregressive, full-trajectory generation approach. By generating a structurally complete conversation skeleton in a single pass, ToolACE-MT can globally enforce task decomposition, control the number of subtasks, and explicitly introduce or suppress complexity independent of underlying model stochasticity. Quantitatively, ToolACE-MT reduced API calls from 275,000 (MAS) to 188,000 for 8,000 samples and improved the pass rate in offline verification from 61.1% (MAS) to 72.3% [(Zeng et al., 18 Aug 2025), §4.3.1].
2. Framework Architecture and Methodology
ToolACE-MT comprises three sequential stages:
- Coarse-Grained Initialization: Samples metadata such as the number of subtasks (), tools per subtask, and steps per subtask. The pipeline then synthesizes an initial conversational skeleton , where represents the initial user message, assistant/tool responses, and subsequent user messages or tool outputs. This stage guarantees that all required workflow steps and tool calls are present and appropriately structured.
- Iterative Refinement: Alternates between complexity injection (introducing phenomena such as clarification requests, tool failures, or non-function turns) and "reasonability refinement" (masking and regenerating dialogue turns to enhance coherence and realism). Mask-and-fill operations are performed with explicit logging to prevent redundant modifications. Each refinement pass samples non-adjacent turns , masks them, and then either replaces them with LLM-generated alternatives or retains higher-quality existing content, as determined by a judger LLM. This stage can be interpreted as locally optimizing the (masked) likelihood:
although ToolACE-MT is not end-to-end trainable as a neural NAT model [(Zeng et al., 18 Aug 2025), §3.4].
- Offline Verification: Applies rule-based checks (e.g., JSON or [func_name(params)] syntax, tool executability, repetition, detection of hallucinated parameters) and model-based semantic checks (decomposing logical consistency into sub-questions answered by specialized LLM "experts"). Failure to pass any sub-check results in the rejection of the generated trajectory.
3. Detailed Pipeline and Generation Process
The initialization step first constructs a POMDP-formalized task decomposition:
- : state set;
- : user subtasks, sampled as , with ;
- : assistant tool calls and NL replies;
- : observations (messages or outputs).
For each subtask , a trajectory is synthesized and then concatenated across all subtasks to yield the full dialogue . Alternation of user/assistant turns is strictly enforced. This initial structure ensures coverage of all substantive action chains before local details are filled in.
Iterative refinement begins with complexity injection. For injection type (e.g., clarification), turn at index is masked and replaced via an LLM-invoked mask-and-extend. The model may also inject additional turns to simulate realistic human-AI interaction phenomena. Successive refinement selects further non-adjacent sets of turns, masks and fills them, and uses an LLM-based judger to adopt the most coherent candidate.
The refinement schedule alternates between complexity and reasonability passes according to fixed parameters (e.g., 5 passes reasonability, 1–3 injections). Probabilities for turn selection are adaptively halved upon selection to ensure turn coverage.
Offline verification both checks for syntactic and semantic acceptability and decomposes trajectory-wide questions into atomic subproblems, answered by secondary LLMs. Only dialogues passing majority checks are retained for training.
4. Experimental Results and Comparative Performance
ToolACE-MT was empirically evaluated against standard MAS on a series of benchmarks, including BFCL-v3 (Berkeley Function Calling Leaderboard), τ-Bench (retail and airline domains), and ACEBench. Training employed LLaMA-3.1-8B-Inst with LoRA adaptation, and both generation methods used GPT-4o-2024-11-20 for reference. 8,000 samples were generated per method.
Key metrics are summarized in the table:
| Benchmark | Metric | ToolACE-MT | MAS | LLaMA-3.1-8B (raw) |
|---|---|---|---|---|
| BFCL-v3 | Multi-Turn Acc | 40.25% | 31.38% | 9.25% |
| BFCL-v3 | Single-Turn | 84.94% | 80.29% | - |
| ACEBench | Multi-Turn | 51.0% | 48.0% | - |
| ACEBench | Agent PA | 34.0% | 15.0% | - |
| τ-Bench | Pass@1 | 20.6% | 15.9% | - |
| Offl. Pass Rate | % Verified | 72.3% | 61.1% | - |
| API Calls | Count | 188k | 275k | - |
Ablation studies confirmed that removing offline verification (-7.75 percentage points, multi-turn accuracy) or iterative refinement (-19.37 pp) substantially degrades quality. This suggests both stages are synergistically critical for achieving high-quality data [(Zeng et al., 18 Aug 2025), §4.2.1, 4.2.2].
Scaling studies further demonstrate that ToolACE-MT enables improved multi-turn performance for models of various backbone sizes (3B, 7B), and that iterative refinement plus verification together are necessary to close the gap to MAS-generated data as model size increases.
5. Data Efficiency, Task Completion, and Practical Gains
ToolACE-MT demonstrates markedly superior data and task efficiency. Across 8,000 instances, it realized a 32% reduction in API calls, with a concomitantly higher trajectory verification pass rate. On τ-Bench, models fine-tuned with ToolACE-MT data completed tasks in an average of 13.7 turns, versus 15.4 for MAS-trained models [(Zeng et al., 18 Aug 2025), §4.3.2].
A plausible implication is that the reduced length and higher concentration of meaningful actions in ToolACE-MT data leads to more concise agentic completion policies in downstream LLMs, supporting improved real-world utility.
6. Generalization, Limitations, and Future Directions
ToolACE-MT's pipeline is adaptable to any tool-augmented LLM scenario demanding multi-turn, multi-step interactions, including robotic control, API orchestration, and scientific workflow simulation. Complexity injection templates can be reparameterized for domain-specific concerns, while offline verification may be extended with external validators such as type checkers or domain simulators.
Identified limitations include:
- Reliance on strong LLMs for data generation; pass rates degrade with weaker models [(Zeng et al., 18 Aug 2025), Table 4].
- Absence of a formal, end-to-end differentiable learning objective; currently, the iterative stages are procedural rather than trainable.
- Hand-crafted verification sub-questions and thresholds potentially miss corner-case inconsistencies.
Future directions proposed are the automation of verification parameters (via human-in-the-loop or reinforcement learning), integration of learned non-autoregressive (NAT) decoders for joint skeleton and refinement learning, and curriculum-based progressive data generation strategies [(Zeng et al., 18 Aug 2025), §7].
7. Significance and Implications
ToolACE-MT establishes a new paradigm for agentic data construction by integrating global, skeleton-based initialization, controlled complexity injection, iterative refinement, and rigorous verification. This combination enables efficient, high-quality, and generalizable data generation for training tool-augmented LLMs, advancing simulation and deployment for agentic reasoning tasks. The empirical results demonstrate both improved model performance and reduced resource consumption, suggesting broad relevance for the development of scalable AI agents requiring dynamic, tool-using conversational abilities (Zeng et al., 18 Aug 2025).