Papers
Topics
Authors
Recent
Search
2000 character limit reached

ToolACE-MT: Efficient Multi-Turn Dialogue Generation

Updated 8 February 2026
  • ToolACE-MT is a non-autoregressive iterative framework that constructs multi-turn agentic dialogue data via a complete dialogue skeleton initialization and mask-and-fill refinement.
  • It achieves enhanced performance by reducing API calls from 275k to 188k and improving offline verification rates from 61.1% to 72.3% over traditional MAS methods.
  • The framework’s three-stage process—initialization, iterative refinement, and offline verification—ensures coherent, efficient data generation for tool-augmented LLMs.

ToolACE-MT is a non-autoregressive iterative generation framework for constructing high-quality multi-turn agentic dialogue data, particularly for tool-augmented LLM scenarios. The framework enables efficient generation of complex multi-step, multi-turn conversational trajectories, supporting function calling, agentic planning, and dynamic user-agent exchanges. ToolACE-MT proceeds through a pipeline that includes a coarse-grained dialogue skeleton initialization, iterative refinement via mask-and-fill operations, and a hybrid offline verification step, yielding superior data quality and efficiency compared to conventional multi-agent simulation (MAS) approaches (Zeng et al., 18 Aug 2025).

1. Motivation and Comparative Advantages

Traditional simulation-based agentic dialogue data generation—typically the Multi-Agent Simulation (MAS) paradigm—grows conversations stepwise via autoregressive LLM calls. MAS is characterized by high computational overhead, as each agent turn (user, assistant, tool) triggers a separate LLM invocation. This leads to high latency and inflated API usage, constraining scalability for large datasets or real-time simulation. MAS also suffers from implicit task complexity (inability to prescribe subtask or turn structure in advance) and limited global context, resulting in inconsistent tool usage and factual drift during generation.

In contrast, ToolACE-MT employs a non-autoregressive, full-trajectory generation approach. By generating a structurally complete conversation skeleton in a single pass, ToolACE-MT can globally enforce task decomposition, control the number of subtasks, and explicitly introduce or suppress complexity independent of underlying model stochasticity. Quantitatively, ToolACE-MT reduced API calls from 275,000 (MAS) to 188,000 for 8,000 samples and improved the pass rate in offline verification from 61.1% (MAS) to 72.3% [(Zeng et al., 18 Aug 2025), §4.3.1].

2. Framework Architecture and Methodology

ToolACE-MT comprises three sequential stages:

  1. Coarse-Grained Initialization: Samples metadata such as the number of subtasks (mm), tools per subtask, and steps per subtask. The pipeline then synthesizes an initial conversational skeleton C(0)=(o0,a1,o1,,an)C^{(0)} = (o^0, a^1, o^1, \dots, a^n), where o0o^0 represents the initial user message, aia^i assistant/tool responses, and oio^i subsequent user messages or tool outputs. This stage guarantees that all required workflow steps and tool calls are present and appropriately structured.
  2. Iterative Refinement: Alternates between complexity injection (introducing phenomena such as clarification requests, tool failures, or non-function turns) and "reasonability refinement" (masking and regenerating dialogue turns to enhance coherence and realism). Mask-and-fill operations are performed with explicit logging to prevent redundant modifications. Each refinement pass samples non-adjacent turns MM, masks them, and then either replaces them with LLM-generated alternatives or retains higher-quality existing content, as determined by a judger LLM. This stage can be interpreted as locally optimizing the (masked) likelihood:

L=iMlogPLLM(turnicontext¬M)\mathcal{L} = -\sum_{i\in M} \log P_{\text{LLM}}(\text{turn}_i | \text{context}_{\neg M})

although ToolACE-MT is not end-to-end trainable as a neural NAT model [(Zeng et al., 18 Aug 2025), §3.4].

  1. Offline Verification: Applies rule-based checks (e.g., JSON or [func_name(params)] syntax, tool executability, repetition, detection of hallucinated parameters) and model-based semantic checks (decomposing logical consistency into sub-questions answered by specialized LLM "experts"). Failure to pass any sub-check results in the rejection of the generated trajectory.

3. Detailed Pipeline and Generation Process

The initialization step first constructs a POMDP-formalized task decomposition:

  • S\mathcal{S}: state set;
  • U\mathcal{U}: user subtasks, sampled as {u1,,um}\left\{u^1, \dots, u^m\right\}, with m[2,5]m\in[2,5];
  • A\mathcal{A}: assistant tool calls and NL replies;
  • O\mathcal{O}: observations (messages or outputs).

For each subtask utu^t, a trajectory Ct=(ot0,at1,ot1,,atst,otst)C_t = (o_t^0, a_t^1, o_t^1, \ldots, a_t^{s_t}, o_t^{s_t}) is synthesized and then concatenated across all mm subtasks to yield the full dialogue CC. Alternation of user/assistant turns is strictly enforced. This initial structure ensures coverage of all substantive action chains before local details are filled in.

Iterative refinement begins with complexity injection. For injection type σ\sigma (e.g., clarification), turn XX at index tt is masked and replaced via an LLM-invoked mask-and-extend. The model may also inject additional turns to simulate realistic human-AI interaction phenomena. Successive refinement selects further non-adjacent sets MM of turns, masks and fills them, and uses an LLM-based judger to adopt the most coherent candidate.

The refinement schedule alternates between complexity and reasonability passes according to fixed parameters (e.g., 5 passes reasonability, 1–3 injections). Probabilities pip_i for turn selection are adaptively halved upon selection to ensure turn coverage.

Offline verification both checks for syntactic and semantic acceptability and decomposes trajectory-wide questions into atomic subproblems, answered by secondary LLMs. Only dialogues passing majority checks are retained for training.

4. Experimental Results and Comparative Performance

ToolACE-MT was empirically evaluated against standard MAS on a series of benchmarks, including BFCL-v3 (Berkeley Function Calling Leaderboard), τ-Bench (retail and airline domains), and ACEBench. Training employed LLaMA-3.1-8B-Inst with LoRA adaptation, and both generation methods used GPT-4o-2024-11-20 for reference. 8,000 samples were generated per method.

Key metrics are summarized in the table:

Benchmark Metric ToolACE-MT MAS LLaMA-3.1-8B (raw)
BFCL-v3 Multi-Turn Acc 40.25% 31.38% 9.25%
BFCL-v3 Single-Turn 84.94% 80.29% -
ACEBench Multi-Turn 51.0% 48.0% -
ACEBench Agent PA 34.0% 15.0% -
τ-Bench Pass@1 20.6% 15.9% -
Offl. Pass Rate % Verified 72.3% 61.1% -
API Calls Count 188k 275k -

Ablation studies confirmed that removing offline verification (-7.75 percentage points, multi-turn accuracy) or iterative refinement (-19.37 pp) substantially degrades quality. This suggests both stages are synergistically critical for achieving high-quality data [(Zeng et al., 18 Aug 2025), §4.2.1, 4.2.2].

Scaling studies further demonstrate that ToolACE-MT enables improved multi-turn performance for models of various backbone sizes (3B, 7B), and that iterative refinement plus verification together are necessary to close the gap to MAS-generated data as model size increases.

5. Data Efficiency, Task Completion, and Practical Gains

ToolACE-MT demonstrates markedly superior data and task efficiency. Across 8,000 instances, it realized a 32% reduction in API calls, with a concomitantly higher trajectory verification pass rate. On τ-Bench, models fine-tuned with ToolACE-MT data completed tasks in an average of 13.7 turns, versus 15.4 for MAS-trained models [(Zeng et al., 18 Aug 2025), §4.3.2].

A plausible implication is that the reduced length and higher concentration of meaningful actions in ToolACE-MT data leads to more concise agentic completion policies in downstream LLMs, supporting improved real-world utility.

6. Generalization, Limitations, and Future Directions

ToolACE-MT's pipeline is adaptable to any tool-augmented LLM scenario demanding multi-turn, multi-step interactions, including robotic control, API orchestration, and scientific workflow simulation. Complexity injection templates can be reparameterized for domain-specific concerns, while offline verification may be extended with external validators such as type checkers or domain simulators.

Identified limitations include:

  • Reliance on strong LLMs for data generation; pass rates degrade with weaker models [(Zeng et al., 18 Aug 2025), Table 4].
  • Absence of a formal, end-to-end differentiable learning objective; currently, the iterative stages are procedural rather than trainable.
  • Hand-crafted verification sub-questions and thresholds potentially miss corner-case inconsistencies.

Future directions proposed are the automation of verification parameters (via human-in-the-loop or reinforcement learning), integration of learned non-autoregressive (NAT) decoders for joint skeleton and refinement learning, and curriculum-based progressive data generation strategies [(Zeng et al., 18 Aug 2025), §7].

7. Significance and Implications

ToolACE-MT establishes a new paradigm for agentic data construction by integrating global, skeleton-based initialization, controlled complexity injection, iterative refinement, and rigorous verification. This combination enables efficient, high-quality, and generalizable data generation for training tool-augmented LLMs, advancing simulation and deployment for agentic reasoning tasks. The empirical results demonstrate both improved model performance and reduced resource consumption, suggesting broad relevance for the development of scalable AI agents requiring dynamic, tool-using conversational abilities (Zeng et al., 18 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToolACE-MT.