DiaFORGE Pipeline: API Disambiguation
- DiaFORGE Pipeline is a disambiguation-centric system designed to handle underspecified and ambiguous API calls in enterprise environments.
- It integrates synthetic multi-turn dialogue generation, supervised fine-tuning with embedded reasoning traces, and dynamic agentic evaluation to ensure schema compliance and endpoint accuracy.
- Empirical results demonstrate significant improvements with up to 0.89 accuracy and reduced false-positive API invocations compared to traditional methods.
DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation) is a disambiguation-centric pipeline designed to improve the reliability and realism of LLMs tasked with enterprise-level API invocation. The pipeline integrates synthetic multi-turn dialogue generation, supervised fine-tuning with embedded reasoning traces, and dynamic evaluation using agentic benchmarks. By targeting scenarios where intent ambiguity and underspecified tool arguments lead to frequent LLM failures, DiaFORGE provides a structured methodology for building tool-calling agents capable of sequential clarification, robust schema fulfillment, and measurable end-to-end goal completion (Hathidara et al., 4 Jul 2025).
1. Motivation and Formalization of the Disambiguation Problem
Modern enterprise environments routinely expose thousands of narrowly specialized APIs, each accompanied by a formal JSON-schema enumerating required and optional parameters. Common user requests to these systems are prone to two principal deficiencies: underspecification, where mandatory parameters such as currencyCode, date, or accountId are omitted, and ambiguity, where the intended functionality maps to several near-duplicate endpoints (e.g., CreateCustomer versus CreateUser).
Primary failure modes observed in conventional tool-calling LLM deployments include incorrect endpoint selection, hallucinated calls to nonexistent APIs, and argument maps that omit required schema keys. Formally, the tool-disambiguation and slot-filling scenario is defined as follows: for the catalog of callable tools, each with parameters and required set , the assistant must construct a candidate subset containing the ground-truth tool , iteratively query for clarifying information to reduce the candidate set to , and collect all required argument values prior to API invocation. This defines a sequential decision process where the dual objectives are minimization of tool-level ambiguity and completion of all formal schema slots.
2. Pipeline Architecture: DiaFORGE’s Three Stages
DiaFORGE implements a multi-stage architecture to address the disambiguation-centric tool-calling challenge.
2.1 Dialogue Synthesis via UTC-Gen
UTC-Gen (Unified Tool-Calling Generator) automates conversation trace construction in three phases:
- Metadata Construction: For each seed tool , a corporate persona is sampled from a pool of 12,000, along with a goal , a set of semantic distractors , and a validated parameter map for .
- Two-Phase Dialogue Synthesis:
- In the tool-selection phase, the user-proxy deliberately reveals minimal information through utterances , while the assistant issues clarifying questions to eliminate distractors.
- Once the ground-truth tool is isolated, the argument-completion phase commences: the assistant requests missing parameters sequentially; user-proxy responds truthfully with values from .
- Validation Cascade: Each candidate dialogue passes format, tool-call, argument, relevancy, and LLM-critique validators, with periodic human spot checking. Stopping criteria require either a schema-conformant tool call or maximum turn .
2.2 Supervised Fine-Tuning with Reasoning Traces
Assistant responses are stored as JSON objects containing:
- “thought”: A private chain-of-thought encoded in > tags > > - “tool_calls”: Optional stub for tool invocation serialization > > - “content”: Public assistant reply > > Dialogues are turn-sliced into samples , where is the system prompt context and is the assistant JSON reply. Loss masking ensures only reply tokens contribute to the standard next-token cross-entropy objective: > > > > Parameter-efficient fine-tuning is implemented using LoRA adapters (, ) atop instruction-tuned, decoder-only model backbones at multiple scales: Llama-3.2 (3B), Gemma-3 (4B, 12B, 27B), Llama-3.3 (70B), and Nemotron-Super (49B). Training uses AdamW for one epoch, 8-bit weights, batch size one, and peak learning rate . > > ### 2.3 Agentic Dynamic Evaluation: DiaBENCH > > Models are deployed in a live agentic loop inside UTC-Gen, freezing the user-proxy policy . Each turn: > > - User utterances are sampled via multi-sampling and voting to reduce hallucination risk > > - Assistant responses are generated by the fine-tuned model conditioned on and candidate set > > Trajectory up to allows computation of end-to-end metrics: > > | Metric | Formula | Description | > |---------------------------------|--------------------------------------------------------------------|------------------------------------| > | Tool-Call Accuracy (Acc) | | Correct tool+arg map | > | False-Positive Tool-Call Rate | | Wrong tool invocations | > | Tool-Call Abstention Rate | | Dialogues with no API invocation | > > Goal completion (Success Rate ) is equated to Acc; optionally, counts dialogues ending in a correct call. > > ## 3. DiaBENCH Dynamic Benchmarking Results > > Under dynamic evaluation using optimized conversational prompting (CAPO), DiaFORGE-finetuned models outperform leading proprietary offerings. Llama-3.2-DiaFORGE-3B achieves an accuracy (Acc) of 0.80 versus GPT-4o’s 0.62 and Claude-3.5-Sonnet’s 0.39. Nemotron-49B-DiaFORGE yields an Acc of 0.89, outperforming GPT-4o by 27 percentage points and Claude-3.5 by 49 points. False-Positive Tool-Call Rate (FTR) and Tool-Call Abstention Rate (TAR) also show improvements, indicating both more reliable endpoint selection and increased schema compliance. > > | Model | Acc | FTR | TAR | > |-------------------------|------|------|------| > | Llama-3.2-DiaFORGE-3B | 0.80 | 0.08 | 0.06 | > | GPT-4o (CAPO prompt) | 0.62 | 0.02 | 0.36 | > | Claude-3.5-Sonnet (CAPO)| 0.39 | 0.03 | 0.55 | > > This suggests that DiaFORGE methodology yields tangible gains in real-world tool-calling effectiveness and robustness compared to static benchmarks or generic prompting protocols. > > ## 4. Released Open Corpus: Composition and Usage > > DiaFORGE’s corpus comprises approximately 5,000 validated enterprise API specifications, each paired with multi-turn dialogues engineered to maximize phenomena of tool ambiguity and slot underspecification. Each corpus record contains: > > - Tool specification (name, description, parameter schema) > > - Persona assignment > > - Distractor set > > - Parameter map > > - Full dialogue with embedded reasoning traces and final tool calls > > Annotation leverages automated validators for format, argument compliance, and relevance, with additional LLM-based critique and periodic human spot validation. The corpus is distributed via HuggingFace (sap-ai-research/diaforge-utc-r-0725); schema fields support direct integration into retrieval pipelines, few-shot configurations, or backbone fine-tuning protocols for tool-calling LLM development in enterprise contexts. > > ## 5. Impact, Reliability, and Prospective Extensions > > DiaFORGE enforces rigorous handling of ambiguity and underspecification, mitigating risk by lowering the rate of hallucinated or premature API invocations and ensuring schema-conformant calls. This closes gaps between static evaluation regimes and the demands of interactive, real-world tool orchestration. > > Proposed extensions include multi-tool orchestration where dialogues involve sequential or composite API invocations, reinforcement learning via on-policy feedback in DiaBENCH, enhanced filtering of user-proxy hallucination through automated methods, and refined auditing to track assistant chain-of-thought activity for compliance and explainability. A plausible implication is broader application to policies involving more complex workflows, and additional transparency for high-stakes enterprise deployments. > > DiaFORGE integrates targeted synthetic data generation, turn-aware SFT using private reasoning traces, and dynamic interactive benchmarks to produce enterprise agents exhibiting improved realism and reduced operational risk in automated API processes (Hathidara et al., 4 Jul 2025).