Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 58 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky (2507.03336v1)

Published 4 Jul 2025 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Summary

  • The paper demonstrates that disambiguation-centric finetuning improves tool invocation success by up to 49 percentage points.
  • The DiaFORGE pipeline generates realistic multi-turn dialogues and uses dynamic evaluation to curtail false-positive and abstention tool calls.
  • Methodological innovations reveal that targeted data and parameter-efficient adaptation can outperform larger models in enterprise API disambiguation.

Disambiguation-Centric Finetuning for Robust Enterprise Tool-Calling LLMs

The paper "Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky" (2507.03336) addresses a critical challenge in deploying LLM-based agents for enterprise tool invocation: the reliable disambiguation among near-duplicate APIs and the proactive elicitation of underspecified arguments in multi-turn dialogues. The authors introduce DiaFORGE, a modular pipeline for data generation, supervised finetuning, and dynamic evaluation, specifically designed to align LLM behavior with the operational requirements of enterprise environments.

Problem Context and Motivation

Enterprise environments typically expose thousands of APIs, many of which are minor variants tailored to different business domains. LLM-based agents tasked with tool invocation must not only select the correct API from a dense, overlapping set but also manage incomplete or ambiguous user requests. Existing benchmarks and training paradigms often fail to capture these challenges, as they rely on static, pre-scripted dialogues that do not require the agent to iteratively clarify intent or fill missing parameters. This leads to two prevalent failure modes in production: premature or incorrect tool invocation and incomplete argument specification, both of which can have significant operational and financial consequences.

DiaFORGE Pipeline

The DiaFORGE framework consists of three tightly integrated stages:

  1. Synthetic Dialogue Generation (UTC-Gen):
    • A multi-agent engine simulates realistic, persona-driven, multi-turn dialogues. Each dialogue is seeded with a ground-truth tool and a set of semantically similar distractor tools, sampled using a frozen sentence encoder over tool metadata.
    • The user-proxy agent issues deliberately under-specified requests, compelling the assistant to ask clarifying questions to resolve tool ambiguity and to elicit all required arguments.
    • Dialogue synthesis is governed by strict validation: format, relevancy, and LLM-based critique validators ensure only high-quality, coherent dialogues are included in the training corpus.
  2. Supervised Fine-Tuning:
    • Instruction-tuned, decoder-only LLMs (3B–70B parameters) are further finetuned on the validated DiaFORGE corpus using a turn-slicing strategy. Each assistant turn is paired with its dialogue prefix, and loss masking ensures the model focuses on predicting the next assistant response.
    • LoRA is used for parameter-efficient adaptation, and only the DiaFORGE-generated data is used for this stage, without additional general-domain SFT data.
  3. Dynamic Evaluation (DiaBENCH):
    • Models are evaluated both statically (isolated response quality) and dynamically (end-to-end, on-policy interaction with a user-proxy agent).
    • Dynamic evaluation measures the model’s ability to maintain contextual coherence, self-correct, and issue schema-conformant tool calls in realistic, multi-turn settings.
    • A multi-sampling and voting strategy is used for user utterance generation to minimize hallucinations and evaluation noise.

Empirical Results

The evaluation on DiaBENCH demonstrates that DiaFORGE-finetuned models achieve substantial improvements over both open-source and proprietary baselines:

  • Tool Invocation Success: On dynamic evaluation, DiaFORGE-finetuned models raise tool-invocation success by 27 percentage points over GPT-4o and by 49 points over Claude-3.5-Sonnet, even when those models are evaluated with optimized system prompts.
  • Failure Modes: Both false-positive tool-call rate (FTR) and tool-call abstention rate (TAR) are significantly reduced in DiaFORGE models, indicating improved disambiguation and reduced risk of missed or spurious tool invocations.
  • Model Scale: Notably, model size is not a monotonic predictor of performance. Mid-sized models (e.g., Llama-3.3-Nemotron-DiaFORGE-49B) outperform much larger models (e.g., Llama-3.3-DiaFORGE-70B), underscoring the importance of targeted data and finetuning over brute parameter scaling.
  • Conversational Quality: DiaFORGE finetuning preserves or improves conversational relevance and diversity, as measured by ConvRel, TTR, and n-gram diversity, with no statistically significant degradation compared to instruction-tuned or proprietary models.

Methodological Contributions

  • Disambiguation-Centric Data Synthesis: The synthetic data engine explicitly constructs scenarios with high tool ambiguity and incomplete argument specification, forcing the model to learn clarification and slot-filling strategies that are essential in enterprise deployments.
  • Dynamic, On-Policy Evaluation: The use of a live agentic loop for evaluation surfaces cascading errors and robustness issues that static benchmarks cannot detect, providing a more realistic assessment of model readiness for production.
  • Open Corpus Release: The release of a rigorously validated corpus of ~5,000 enterprise API specifications and disambiguation-focused dialogues provides a practical blueprint for the community to build and benchmark reliable tool-calling agents.

Implications and Future Directions

The results demonstrate that disambiguation-centric finetuning is essential for deploying LLM agents in high-stakes, tool-rich enterprise environments. The approach mitigates key risks—incorrect tool selection and incomplete argument gathering—that are not adequately addressed by existing benchmarks or generic instruction tuning. The findings also challenge the assumption that larger models or prompt engineering alone suffice for robust tool use; instead, targeted data and evaluation protocols are necessary.

Several open challenges remain. Extending DiaFORGE to multi-tool, planning-intensive dialogues would further enhance realism and benchmark the agent’s capacity for sequencing and recovery. Automating user-proxy validation to reduce human oversight in dynamic evaluation is another avenue for scaling deployment. Finally, integrating retrieval-augmented generation or hybrid symbolic-neural approaches may further improve disambiguation in extremely dense API surfaces.

Conclusion

This work establishes a new standard for training and evaluating enterprise tool-calling LLMs, demonstrating that disambiguation-centric finetuning yields substantial gains in both reliability and safety. The modular pipeline, open corpus, and dynamic evaluation protocol together provide a foundation for future research and deployment of LLM agents in complex, real-world operational settings.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com