ToolACE-MT: Non-Autoregressive Iterative Generation
- The paper introduces a three-stage non-autoregressive iterative generation pipeline that replaces expensive autoregressive simulations with a coarse initialization, iterative refinement, and offline verification process.
- It leverages GPT-4o for coarse-grained initialization and employs masking, tool-aware injections, and judge-assisted fills to enhance semantic polish at the turn level.
- Empirical results demonstrate significant improvements, including a 40.25% multi-turn accuracy and a 72.3% pass rate, while reducing API call costs compared to traditional MAS pipelines.
Non-Autoregressive Iterative Generation (ToolACE-MT) is a data generation and modeling framework designed to construct high-quality, multi-turn, agentic dialogues for LLMs engaged in complex tool-augmented interactions. It replaces traditional, expensive autoregressive multi-agent simulation pipelines with a three-stage non-autoregressive process—coarse-grained initialization, iterative refinement, and offline verification. The result is increased data quality, generalizability, and efficiency for downstream agentic and function-calling LLM benchmarks, with strong empirical advantages in accuracy and computational cost (Zeng et al., 18 Aug 2025).
1. Formal Definition and Overall Pipeline
Let be a tool pool and a sequence of user subtasks. The generation objective is to synthesize a dialogue , where are assistant turns (possibly with function calls) and are user or tool outputs. Unlike autoregressive methods that maximize
ToolACE-MT factorizes at the turn level, proceeding as: where is a coarse skeleton, is the dialogue after refinement steps, and 0, 1 are LLM-driven generation functions. The final output 2 is accepted only if it passes offline verification.
The generative process is formalized as: 3 with a budget on the number of refinement steps 4. In practice, explicit joint training is not performed; each phase employs a prompted LLM instance (Zeng et al., 18 Aug 2025).
2. Three-Stage Iterative Generation Procedure
2.1 Coarse-Grained Initialization
Algorithmically:
- Select subtasks 5, each annotated with tools 6 and step count 7.
- For 8:
- 9 user requests 0 (parameters abstract).
- Generate sub-trajectory 1 in parallel; for each function call in 2, emit associated tool output 3.
- Concatenate all 4 to produce skeleton 5.
This stage imposes role alternation but does not enforce semantic polish or coherence.
2.2 Iterative Refinement
Given skeleton 6, perform 7 alternating refinement steps:
- Complexity Injection (even 8): Select an injection type 9 (clarify, tool-aware, error, non-func). Mask a random turn 0, then apply a fill-extend operation.
- Reasonability Refinement (odd 1): Mask a small set 2 of non-adjacent turns. For each, generate candidate fills; a “judger” LLM model accepts/rejects fills, updating 3.
4
Iteration halts at 5 steps or once all turns have been refined at least once. Each step only maximizes a stepwise plausibility score, not a global likelihood (Zeng et al., 18 Aug 2025).
2.3 Offline Verification
Candidate 6 undergoes:
- Rule-based checks: Format compliance (e.g., JSON, function-call syntax), tool executability, and non-repetition of identifiers.
- Model-based checks: Decompose coherence into sub-questions; prompt LLM “experts” to answer. Accept if both all rules pass and most sub-questions are affirmed.
This stage is critical for semantic consistency, syntactic validity, and agentic action correctness.
3. Model Architectures and Training Regimen
- Data Generation: All three stages employ GPT-4o-2024-11-20, prompted as a black-box LLM; GPT-4o-mini exhibits elevated hallucination and failure rates.
- Fine-Tuning: Downstream function-calling models use LLaMA3.1-8B-Inst, fine-tuned via LoRA (rank 16, 7 32), cross-entropy loss, batch size 64, learning rate 8, cosine schedule with 10% warmup.
- Masking: Reasonability refinement employs span-masking at the full-turn granularity (not token-level) for surrogate masked infilling.
No parameters are updated in the data-generation LLMs; all adaptation is via prompt engineering and LoRA for downstream models (Zeng et al., 18 Aug 2025).
4. Empirical Evaluation and Results
ToolACE-MT and baseline Multi-Agent Simulation (MAS) pipelines were both used to synthesize 8,000 dialogues with identical tool pools and verification.
Benchmark Performance
| Benchmark | MAS (LLaMA3.1-8B) | ToolACE-MT | GPT-4o Upper Bound |
|---|---|---|---|
| BFCL-v3 Multi-Turn Accuracy (%) | 31.38 | 40.25 | 50.00 |
| ACEBench Agent Process Accuracy | 15.0 | 34.0 | — |
| τ-Bench Overall pass@1 (%) | 15.9 | 20.6 | — |
Additional metrics: MAS yields ≈275K API calls for 8K valid samples (61.1% pass rate); ToolACE-MT requires ≈188K (72.3% pass rate). On τ-Bench, average assistant turns per task are 13.7 (ToolACE-MT) vs 15.4 (MAS). These results establish significant improvements in both efficiency and quality (Zeng et al., 18 Aug 2025).
5. Ablation Analysis and System Limitations
Ablation Findings
- Offline Verification: Removing it degrades BFCL-v3 from 40.25% to 32.50% (–7.75 pp).
- No Iterative Refinement: BFCL-v3 drops to 20.88%, showing that semantic polishing is essential.
- Refinement Scaling: Increasing the number of reasonability passes reduces discrepancy between with/without verification, but even 30 passes cannot fully eliminate the necessity for offline checks.
- Backbone Independence: Other model families (Qwen2.5-7B, Qwen3-8B) gain ~6 pp with ToolACE-MT versus MAS.
- Model Scaling: Gains from fine-tuning ToolACE-MT outputs are more pronounced as model size increases (0.5B→7B).
Noted Limitations
- Strong performance hinges on use of a high-end LLM (GPT-4o); smaller or less capable models yield incoherent output and reduce verified sample rates.
- The framework lacks an end-to-end differentiable objective; system quality crucially depends on LLM prompt design and the strength of offline filtering.
- There is an inherent trade-off between multi-turn caution (high clarification rates) and “live” single-turn efficiency (Zeng et al., 18 Aug 2025).
6. Context Within Non-Autoregressive and Iterative Generation Paradigms
ToolACE-MT adapts iterative non-autoregressive sequence modeling (see Mask-Predict, Levenshtein Transformer, NAT with iterative refinement (Lee et al., 2018)) for the agentic, structured multi-turn generation setting. Like Gu et al. (Lee et al., 2018), ToolACE-MT leverages iterative refinement to bridge the quality gap to autoregressive models while drastically reducing inference cost; but it is unique in operating at the turn rather than token level and in introducing agentic action modeling.
Comparable approaches in non-autoregressive sequence generation—including EM-style joint AR/NAR optimization (Sun et al., 2020), continuous-space iterative refinement (Lee et al., 2020), and discrete diffusion-based NAR models for dialogue and text (Zhou et al., 2023)—have informed development of multi-pass, mask-and-fill procedures. ToolACE-MT's modular, deterministic pipeline, together with strong LLM-based masking and judges, establishes a practical blueprint for scalable agentic data synthesis at scale.
7. Connections to Broader NAR Iterative Refinement Literature
- Token- and Sequence-Level NAR: ToolACE-MT's refinement steps are structurally analogous to NAT with iterative masked guessing (Lee et al., 2018), but expanded to include agentic dialogue acts and external tool calls.
- Diffusion and Denoising: Mask-and-fill iterative updates align closely with discrete diffusion and masked language modeling frameworks (Zhou et al., 2023). Flexible injection types and mask strategies enable incorporation of tool outputs and user feedback directly into intermediate states.
- EM and Posterior Regularization: The multi-pass, judge-assisted pipeline can be interpreted as alternating E- and M-steps in the sense of Sun & Yang (Sun et al., 2020), but conducted in open-ended conversational state spaces rather than sentence-pairing for translation.
- Continuous-Space Alternatives: Continuous-latent refinement as proposed by Lee et al. (Lee et al., 2020) remains compatible in situations permitting differentiable agentic latent states, but ToolACE-MT targets discrete, highly structured agentic workflow graphs.
In sum, ToolACE-MT operationalizes non-autoregressive iterative refinement for agentic, tool-calling LLM tasks, achieving improvements in both data quality and synthetic dialogue generation cost, validated by consistent and significant empirical gains over strong simulation-based and NAR baselines (Zeng et al., 18 Aug 2025, Lee et al., 2018, Zhou et al., 2023, Sun et al., 2020, Lee et al., 2020).