DiaFORGE Pipeline: API Disambiguation

Updated 3 January 2026

DiaFORGE Pipeline is a disambiguation-centric system designed to handle underspecified and ambiguous API calls in enterprise environments.
It integrates synthetic multi-turn dialogue generation, supervised fine-tuning with embedded reasoning traces, and dynamic agentic evaluation to ensure schema compliance and endpoint accuracy.
Empirical results demonstrate significant improvements with up to 0.89 accuracy and reduced false-positive API invocations compared to traditional methods.

DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation) is a disambiguation-centric pipeline designed to improve the reliability and realism of LLMs tasked with enterprise-level API invocation. The pipeline integrates synthetic multi-turn dialogue generation, supervised fine-tuning with embedded reasoning traces, and dynamic evaluation using agentic benchmarks. By targeting scenarios where intent ambiguity and underspecified tool arguments lead to frequent LLM failures, DiaFORGE provides a structured methodology for building tool-calling agents capable of sequential clarification, robust schema fulfillment, and measurable end-to-end goal completion (Hathidara et al., 4 Jul 2025).

1. Motivation and Formalization of the Disambiguation Problem

Modern enterprise environments routinely expose thousands of narrowly specialized APIs, each accompanied by a formal JSON-schema enumerating required and optional parameters. Common user requests to these systems are prone to two principal deficiencies: underspecification, where mandatory parameters such as currencyCode, date, or accountId are omitted, and ambiguity, where the intended functionality maps to several near-duplicate endpoints (e.g., CreateCustomer versus CreateUser).

Primary failure modes observed in conventional tool-calling LLM deployments include incorrect endpoint selection, hallucinated calls to nonexistent APIs, and argument maps that omit required schema keys. Formally, the tool-disambiguation and slot-filling scenario is defined as follows: for the catalog $\mathcal{T} = \{\tau_1, ..., \tau_{|\mathcal{T}|}\}$ of callable tools, each with parameters $\text{params}(\tau_i)$ and required set $\mathcal{R}(\tau_i)$ , the assistant must construct a candidate subset $\mathcal{C}_k(u_1) \subset \mathcal{T}$ containing the ground-truth tool $\tau^*$ , iteratively query for clarifying information to reduce the candidate set to $|\mathcal{C}_k|=1$ , and collect all required argument values prior to API invocation. This defines a sequential decision process where the dual objectives are minimization of tool-level ambiguity and completion of all formal schema slots.

2. Pipeline Architecture: DiaFORGE’s Three Stages

DiaFORGE implements a multi-stage architecture to address the disambiguation-centric tool-calling challenge.

2.1 Dialogue Synthesis via UTC-Gen

UTC-Gen (Unified Tool-Calling Generator) automates conversation trace construction in three phases:

Metadata Construction: For each seed tool $\tau^* \in \mathcal{T}$ , a corporate persona $p \sim \pi_\text{rand}(k | \tau^*, \text{PersonaHub})$ is sampled from a pool of 12,000, along with a goal $g \sim P_\text{goal}(\cdot | \tau^*, p)$ , a set of semantic distractors $\mathcal{D}_k(\tau^*) = \text{arg top}_k\{\langle \phi(\tau^*),\phi(\tau)\rangle\,|\,\tau \neq \tau^* \}$ , and a validated parameter map $\mathcal{V}^* = \{(r_i, v_i)\}$ for $r_i \in \mathcal{R}(\tau^*)$ .
Two-Phase Dialogue Synthesis:
- In the tool-selection phase, the user-proxy deliberately reveals minimal information through utterances $u_t \sim P_{\theta_u}(\cdot | \tau^*, p, g, \mathcal{D}_k, \mathcal{V}^*, \text{history}_{u_t})$ , while the assistant issues clarifying questions $a_t \sim P_{\theta_a}(\cdot | \mathcal{C}_k, \text{history}_{a_t})$ to eliminate distractors.
- Once the ground-truth tool is isolated, the argument-completion phase commences: the assistant requests missing parameters sequentially; user-proxy responds truthfully with values from $\mathcal{V}^*$ .
Validation Cascade: Each candidate dialogue passes format, tool-call, argument, relevancy, and LLM-critique validators, with periodic human spot checking. Stopping criteria require either a schema-conformant tool call or maximum turn $T_\text{max}$ .

2.2 Supervised Fine-Tuning with Reasoning Traces

Assistant responses are stored as JSON objects containing:

“thought”: A private chain-of-thought encoded in > tags > > - “tool_calls”: Optional stub for tool invocation serialization > > - “content”: Public assistant reply > > Dialogues are turn-sliced into samples $(x_{i,t}, y_{i,t})$ , where $x_{i,t}$ is the system prompt context and $y_{i,t}$ is the assistant JSON reply. Loss masking ensures only reply tokens contribute to the standard next-token cross-entropy objective: > > $L(\phi) = -\sum_{i,t} \sum_{j=1}^{|y_{i,t}|} \log p_\phi(y_{i,t}[j] \mid x_{i,t}, y_{i,t}[<j])$ > > Parameter-efficient fine-tuning is implemented using LoRA adapters ( $r=16$ , $\alpha=16$ ) atop instruction-tuned, decoder-only model backbones at multiple scales: Llama-3.2 (3B), Gemma-3 (4B, 12B, 27B), Llama-3.3 (70B), and Nemotron-Super (49B). Training uses AdamW for one epoch, 8-bit weights, batch size one, and peak learning rate $1 \times 10^{-4}$ . > > ### 2.3 Agentic Dynamic Evaluation: DiaBENCH > > Models are deployed in a live agentic loop inside UTC-Gen, freezing the user-proxy policy $P_{\theta_u}$ . Each turn: > > - User utterances $\hat{u}_t$ are sampled via multi-sampling and voting to reduce hallucination risk > > - Assistant responses $\hat{a}_t$ are generated by the fine-tuned model $f_\phi$ conditioned on $\text{history}_{a_t}$ and candidate set $\mathcal{C}_k$ > > Trajectory $d_f = \langle(\hat{u}_1, \hat{a}_1), ..., (\hat{u}_{T'}, \hat{a}_{T'})\rangle$ up to $T_\text{max}$ allows computation of end-to-end metrics: > > | Metric | Formula | Description | > |---------------------------------|--------------------------------------------------------------------|------------------------------------| > | Tool-Call Accuracy (Acc) | $1/|S| \sum_d 1\{c(d) = g(d)\}$ | Correct tool+arg map | > | False-Positive Tool-Call Rate | $1/|S| \sum_d (\#\,\text{invoked}\,\tau \neq \tau^*)$ | Wrong tool invocations | > | Tool-Call Abstention Rate | $1/|S| \sum_d 1\{\text{no tool call}\}$ | Dialogues with no API invocation | > > Goal completion (Success Rate $S$ ) is equated to Acc; optionally, $S_\text{end}$ counts dialogues ending in a correct call. > > ## 3. DiaBENCH Dynamic Benchmarking Results > > Under dynamic evaluation using optimized conversational prompting (CAPO), DiaFORGE-finetuned models outperform leading proprietary offerings. Llama-3.2-DiaFORGE-3B achieves an accuracy (Acc) of 0.80 versus GPT-4o’s 0.62 and Claude-3.5-Sonnet’s 0.39. Nemotron-49B-DiaFORGE yields an Acc of 0.89, outperforming GPT-4o by 27 percentage points and Claude-3.5 by 49 points. False-Positive Tool-Call Rate (FTR) and Tool-Call Abstention Rate (TAR) also show improvements, indicating both more reliable endpoint selection and increased schema compliance. > > | Model | Acc | FTR | TAR | > |-------------------------|------|------|------| > | Llama-3.2-DiaFORGE-3B | 0.80 | 0.08 | 0.06 | > | GPT-4o (CAPO prompt) | 0.62 | 0.02 | 0.36 | > | Claude-3.5-Sonnet (CAPO)| 0.39 | 0.03 | 0.55 | > > This suggests that DiaFORGE methodology yields tangible gains in real-world tool-calling effectiveness and robustness compared to static benchmarks or generic prompting protocols. > > ## 4. Released Open Corpus: Composition and Usage > > DiaFORGE’s corpus comprises approximately 5,000 validated enterprise API specifications, each paired with multi-turn dialogues engineered to maximize phenomena of tool ambiguity and slot underspecification. Each corpus record contains: > > - Tool specification (name, description, parameter schema) > > - Persona assignment > > - Distractor set > > - Parameter map > > - Full dialogue with embedded reasoning traces and final tool calls > > Annotation leverages automated validators for format, argument compliance, and relevance, with additional LLM-based critique and periodic human spot validation. The corpus is distributed via HuggingFace (sap-ai-research/diaforge-utc-r-0725); schema fields support direct integration into retrieval pipelines, few-shot configurations, or backbone fine-tuning protocols for tool-calling LLM development in enterprise contexts. > > ## 5. Impact, Reliability, and Prospective Extensions > > DiaFORGE enforces rigorous handling of ambiguity and underspecification, mitigating risk by lowering the rate of hallucinated or premature API invocations and ensuring schema-conformant calls. This closes gaps between static evaluation regimes and the demands of interactive, real-world tool orchestration. > > Proposed extensions include multi-tool orchestration where dialogues involve sequential or composite API invocations, reinforcement learning via on-policy feedback in DiaBENCH, enhanced filtering of user-proxy hallucination through automated methods, and refined auditing to track assistant chain-of-thought activity for compliance and explainability. A plausible implication is broader application to policies involving more complex workflows, and additional transparency for high-stakes enterprise deployments. > > DiaFORGE integrates targeted synthetic data generation, turn-aware SFT using private reasoning traces, and dynamic interactive benchmarks to produce enterprise agents exhibiting improved realism and reduced operational risk in automated API processes (Hathidara et al., 4 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DiaFORGE Pipeline.

DiaFORGE Pipeline: API Disambiguation

1. Motivation and Formalization of the Disambiguation Problem

2. Pipeline Architecture: DiaFORGE’s Three Stages

2.1 Dialogue Synthesis via UTC-Gen

2.2 Supervised Fine-Tuning with Reasoning Traces

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

DiaFORGE Pipeline: API Disambiguation

1. Motivation and Formalization of the Disambiguation Problem

2. Pipeline Architecture: DiaFORGE’s Three Stages

2.1 Dialogue Synthesis via UTC-Gen

2.2 Supervised Fine-Tuning with Reasoning Traces

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research