Papers
Topics
Authors
Recent
2000 character limit reached

DiaFORGE Pipeline: API Disambiguation

Updated 3 January 2026
  • DiaFORGE Pipeline is a disambiguation-centric system designed to handle underspecified and ambiguous API calls in enterprise environments.
  • It integrates synthetic multi-turn dialogue generation, supervised fine-tuning with embedded reasoning traces, and dynamic agentic evaluation to ensure schema compliance and endpoint accuracy.
  • Empirical results demonstrate significant improvements with up to 0.89 accuracy and reduced false-positive API invocations compared to traditional methods.

DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation) is a disambiguation-centric pipeline designed to improve the reliability and realism of LLMs tasked with enterprise-level API invocation. The pipeline integrates synthetic multi-turn dialogue generation, supervised fine-tuning with embedded reasoning traces, and dynamic evaluation using agentic benchmarks. By targeting scenarios where intent ambiguity and underspecified tool arguments lead to frequent LLM failures, DiaFORGE provides a structured methodology for building tool-calling agents capable of sequential clarification, robust schema fulfillment, and measurable end-to-end goal completion (Hathidara et al., 4 Jul 2025).

1. Motivation and Formalization of the Disambiguation Problem

Modern enterprise environments routinely expose thousands of narrowly specialized APIs, each accompanied by a formal JSON-schema enumerating required and optional parameters. Common user requests to these systems are prone to two principal deficiencies: underspecification, where mandatory parameters such as currencyCode, date, or accountId are omitted, and ambiguity, where the intended functionality maps to several near-duplicate endpoints (e.g., CreateCustomer versus CreateUser).

Primary failure modes observed in conventional tool-calling LLM deployments include incorrect endpoint selection, hallucinated calls to nonexistent APIs, and argument maps that omit required schema keys. Formally, the tool-disambiguation and slot-filling scenario is defined as follows: for the catalog T={τ1,...,τT}\mathcal{T} = \{\tau_1, ..., \tau_{|\mathcal{T}|}\} of callable tools, each with parameters params(τi)\text{params}(\tau_i) and required set R(τi)\mathcal{R}(\tau_i), the assistant must construct a candidate subset Ck(u1)T\mathcal{C}_k(u_1) \subset \mathcal{T} containing the ground-truth tool τ\tau^*, iteratively query for clarifying information to reduce the candidate set to Ck=1|\mathcal{C}_k|=1, and collect all required argument values prior to API invocation. This defines a sequential decision process where the dual objectives are minimization of tool-level ambiguity and completion of all formal schema slots.

2. Pipeline Architecture: DiaFORGE’s Three Stages

DiaFORGE implements a multi-stage architecture to address the disambiguation-centric tool-calling challenge.

2.1 Dialogue Synthesis via UTC-Gen

UTC-Gen (Unified Tool-Calling Generator) automates conversation trace construction in three phases:

  • Metadata Construction: For each seed tool τT\tau^* \in \mathcal{T}, a corporate persona pπrand(kτ,PersonaHub)p \sim \pi_\text{rand}(k | \tau^*, \text{PersonaHub}) is sampled from a pool of 12,000, along with a goal gPgoal(τ,p)g \sim P_\text{goal}(\cdot | \tau^*, p), a set of semantic distractors Dk(τ)=arg top­k{ϕ(τ),ϕ(τ)ττ}\mathcal{D}_k(\tau^*) = \text{arg top­}_k\{\langle \phi(\tau^*),\phi(\tau)\rangle\,|\,\tau \neq \tau^* \}, and a validated parameter map V={(ri,vi)}\mathcal{V}^* = \{(r_i, v_i)\} for riR(τ)r_i \in \mathcal{R}(\tau^*).
  • Two-Phase Dialogue Synthesis:
    • In the tool-selection phase, the user-proxy deliberately reveals minimal information through utterances utPθu(τ,p,g,Dk,V,historyut)u_t \sim P_{\theta_u}(\cdot | \tau^*, p, g, \mathcal{D}_k, \mathcal{V}^*, \text{history}_{u_t}), while the assistant issues clarifying questions atPθa(Ck,historyat)a_t \sim P_{\theta_a}(\cdot | \mathcal{C}_k, \text{history}_{a_t}) to eliminate distractors.
    • Once the ground-truth tool is isolated, the argument-completion phase commences: the assistant requests missing parameters sequentially; user-proxy responds truthfully with values from V\mathcal{V}^*.
  • Validation Cascade: Each candidate dialogue passes format, tool-call, argument, relevancy, and LLM-critique validators, with periodic human spot checking. Stopping criteria require either a schema-conformant tool call or maximum turn TmaxT_\text{max}.

2.2 Supervised Fine-Tuning with Reasoning Traces

Assistant responses are stored as JSON objects containing:

  • “thought”: A private chain-of-thought encoded in > tags > > - “tool_calls”: Optional stub for tool invocation serialization > > - “content”: Public assistant reply > > Dialogues are turn-sliced into samples (xi,t,yi,t)(x_{i,t}, y_{i,t}), where xi,tx_{i,t} is the system prompt context and yi,ty_{i,t} is the assistant JSON reply. Loss masking ensures only reply tokens contribute to the standard next-token cross-entropy objective: > > L(ϕ)=i,tj=1yi,tlogpϕ(yi,t[j]xi,t,yi,t[<j])L(\phi) = -\sum_{i,t} \sum_{j=1}^{|y_{i,t}|} \log p_\phi(y_{i,t}[j] \mid x_{i,t}, y_{i,t}[<j]) > > Parameter-efficient fine-tuning is implemented using LoRA adapters (r=16r=16, α=16\alpha=16) atop instruction-tuned, decoder-only model backbones at multiple scales: Llama-3.2 (3B), Gemma-3 (4B, 12B, 27B), Llama-3.3 (70B), and Nemotron-Super (49B). Training uses AdamW for one epoch, 8-bit weights, batch size one, and peak learning rate 1×1041 \times 10^{-4}. > > ### 2.3 Agentic Dynamic Evaluation: DiaBENCH > > Models are deployed in a live agentic loop inside UTC-Gen, freezing the user-proxy policy PθuP_{\theta_u}. Each turn: > > - User utterances u^t\hat{u}_t are sampled via multi-sampling and voting to reduce hallucination risk > > - Assistant responses a^t\hat{a}_t are generated by the fine-tuned model fϕf_\phi conditioned on historyat\text{history}_{a_t} and candidate set Ck\mathcal{C}_k > > Trajectory df=(u^1,a^1),...,(u^T,a^T)d_f = \langle(\hat{u}_1, \hat{a}_1), ..., (\hat{u}_{T'}, \hat{a}_{T'})\rangle up to TmaxT_\text{max} allows computation of end-to-end metrics: > > | Metric | Formula | Description | > |---------------------------------|--------------------------------------------------------------------|------------------------------------| > | Tool-Call Accuracy (Acc) | 1/Sd1{c(d)=g(d)}1/|S| \sum_d 1\{c(d) = g(d)\} | Correct tool+arg map | > | False-Positive Tool-Call Rate | 1/Sd(#invokedττ)1/|S| \sum_d (\#\,\text{invoked}\,\tau \neq \tau^*) | Wrong tool invocations | > | Tool-Call Abstention Rate | 1/Sd1{no tool call}1/|S| \sum_d 1\{\text{no tool call}\} | Dialogues with no API invocation | > > Goal completion (Success Rate SS) is equated to Acc; optionally, SendS_\text{end} counts dialogues ending in a correct call. > > ## 3. DiaBENCH Dynamic Benchmarking Results > > Under dynamic evaluation using optimized conversational prompting (CAPO), DiaFORGE-finetuned models outperform leading proprietary offerings. Llama-3.2-DiaFORGE-3B achieves an accuracy (Acc) of 0.80 versus GPT-4o’s 0.62 and Claude-3.5-Sonnet’s 0.39. Nemotron-49B-DiaFORGE yields an Acc of 0.89, outperforming GPT-4o by 27 percentage points and Claude-3.5 by 49 points. False-Positive Tool-Call Rate (FTR) and Tool-Call Abstention Rate (TAR) also show improvements, indicating both more reliable endpoint selection and increased schema compliance. > > | Model | Acc | FTR | TAR | > |-------------------------|------|------|------| > | Llama-3.2-DiaFORGE-3B | 0.80 | 0.08 | 0.06 | > | GPT-4o (CAPO prompt) | 0.62 | 0.02 | 0.36 | > | Claude-3.5-Sonnet (CAPO)| 0.39 | 0.03 | 0.55 | > > This suggests that DiaFORGE methodology yields tangible gains in real-world tool-calling effectiveness and robustness compared to static benchmarks or generic prompting protocols. > > ## 4. Released Open Corpus: Composition and Usage > > DiaFORGE’s corpus comprises approximately 5,000 validated enterprise API specifications, each paired with multi-turn dialogues engineered to maximize phenomena of tool ambiguity and slot underspecification. Each corpus record contains: > > - Tool specification (name, description, parameter schema) > > - Persona assignment > > - Distractor set > > - Parameter map > > - Full dialogue with embedded reasoning traces and final tool calls > > Annotation leverages automated validators for format, argument compliance, and relevance, with additional LLM-based critique and periodic human spot validation. The corpus is distributed via HuggingFace (sap-ai-research/diaforge-utc-r-0725); schema fields support direct integration into retrieval pipelines, few-shot configurations, or backbone fine-tuning protocols for tool-calling LLM development in enterprise contexts. > > ## 5. Impact, Reliability, and Prospective Extensions > > DiaFORGE enforces rigorous handling of ambiguity and underspecification, mitigating risk by lowering the rate of hallucinated or premature API invocations and ensuring schema-conformant calls. This closes gaps between static evaluation regimes and the demands of interactive, real-world tool orchestration. > > Proposed extensions include multi-tool orchestration where dialogues involve sequential or composite API invocations, reinforcement learning via on-policy feedback in DiaBENCH, enhanced filtering of user-proxy hallucination through automated methods, and refined auditing to track assistant chain-of-thought activity for compliance and explainability. A plausible implication is broader application to policies involving more complex workflows, and additional transparency for high-stakes enterprise deployments. > > DiaFORGE integrates targeted synthetic data generation, turn-aware SFT using private reasoning traces, and dynamic interactive benchmarks to produce enterprise agents exhibiting improved realism and reduced operational risk in automated API processes (Hathidara et al., 4 Jul 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DiaFORGE Pipeline.