Benchmarking Web API Code Generation
- Benchmarking web API integration code generation is the automated synthesis of source code for invoking REST endpoints, guided by OpenAPI specifications and LLM inputs.
- The benchmark constructs a synthetic dataset pairing natural language tasks with canonical API configurations, ensuring strict compliance through manual curation.
- Empirical evaluations reveal challenges like endpoint hallucinations, incomplete argument mapping, and syntax errors, highlighting the need for robust LLM-driven methods.
Web API integration code generation refers to the automated synthesis of source code that correctly invokes web APIs—typically REST endpoints defined in public OpenAPI specifications, service SDKs, or proprietary platforms—given a developer’s intent, natural language description, or functional requirements. Benchmarking in this context aims to rigorously assess and compare the ability of automated tools, particularly LLMs, to produce conformant, correct, and executable code that performs the intended integration operation. This task is particularly challenging due to the diversity of APIs, complex parameterization, incomplete contextual knowledge, and the need for strict adherence to endpoint definitions and argument schemas.
1. Benchmark Dataset Construction
Benchmark datasets for Web API integration code generation are characterized by their grounding in real-world API specifications and their systematic pairing of concise, task-oriented natural language input with unambiguous, specification-compliant invocation code.
The referenced benchmark introduces a dataset composed of triples (a, t, c):
- a: The web API target (e.g., Asana, Google Calendar, Google Sheets, Slack)
- t: A natural language task (e.g., "Create a new secondary calendar named 'Example Calendar' with time zone 'America/Los_Angeles'.")
- c: The canonical request configuration, a JSON structure that combines the full endpoint URL, HTTP method, and all required parameters (segregated into data, headers, and query parameters).
Construction proceeds synthetically, using an LLM (Gemini 1.5 Pro) and OpenAPI specifications to generate a corpus of nearly 400 carefully curated samples. Real-world repositories typically lack sufficiently rich and unambiguous annotated data, hence the reliance on synthetic generation and subsequent manual curation. Each task-configuration pair is cross-validated for compliance with the respective OpenAPI specification, thus ensuring that the target code represents an executable and correct API call (Maninger et al., 24 Sep 2025).
2. Evaluation Pipeline and Experimental Setup
The benchmarking pipeline is methodically staged:
- Dataset Creation and Curation: Task and correct configuration pairs, (t, c), are generated from OpenAPI specs, using zero-shot prompting to an LLM to avoid biasing toward specific code structures or endpoints.
- Prompted Code Generation: Each model under evaluation receives the API identifier a and the task description t, embedded as a comment in the provided starter code. Two setups are tested:
- Full completion: Only the comment and code stub (e.g.,
axios.
) are provided; the model must generate the full API call, including endpoint selection. - Argument completion: The HTTP method and correct endpoint are given; the model must generate only the correct arguments.
- Full completion: Only the comment and code stub (e.g.,
- Controlled Code Execution: Each generated code sample is executed in a mock environment. For instance, in JavaScript/Axios, the runtime intercepts the Axios invocation to extract the produced request configuration c′ directly before the external HTTP request would be made.
- Configuration-Level Correctness Comparison: The generated configuration c′ is compared against the ground truth c. Comparison is fine-grained, examining the URL, HTTP method, and parameter sets at all argument positions. Executable code is further divided into those that are fully correct, have the correct URL, or use only legal parameters.
The process and its input-output relationships are notated with:
where Gen is the code generation process from OpenAPI "s" and API "a," and M is the model output given its input.
3. Metrics and Error Analysis
This benchmark introduces rigorous, element-wise metrics:
- Correct Implementations: Proportion of code samples for which c′ matches c exactly, across all top-level elements.
- Correct URLs / Illegal URLs: Fraction of code that references the precise specification endpoint; fraction that generates an endpoint not listed in the spec.
- Correct Methods / Illegal Methods: Proportion of code using the correct HTTP method or introducing an illegal method for the endpoint.
- Argument Metrics:
- Mean Precision: For executable code, the percentage of provided arguments that are correct.
- Mean Recall: The proportion of required arguments that are generated.
- Conditional Value Accuracy: Given a correct argument name, the rate at which its value matches the reference configuration.
Metrics are reported as both “(t)” values (over all samples) and “(e)” values (over samples producing executable code). This distinction is essential as models frequently generate non-executable code, particularly in the full completion setting when failures to complete the code stub are prevalent.
This systematic breakdown yields detailed insight into specific points of failure: hallucinated endpoints, illegal parameter usage, incomplete argument provision, or outright syntactic errors (Maninger et al., 24 Sep 2025).
4. Empirical Findings and Model Behavior
The evaluation of multiple open-source LLMs (and commercial baselines such as GPT-4o) produces several salient findings:
- Endpoint Memorization vs. Hallucination: Open-source models frequently hallucinate endpoint URLs, with illegal endpoint rates reaching 39% and incorrect parameter names up to 31%. This is substantial even though these models are exposed to API usage patterns in pretraining. Commercial LLMs exhibit better, but still imperfect, endpoint accuracy.
- Argument Completion Challenges: Even with the endpoint and HTTP method supplied, models inconsistently provide all required arguments and may include extraneous or invalid parameters. Argument precision for executable code varies from 51% to 75%; recall is lower (33–69%), indicating frequent omission of necessary arguments.
- Code Syntax and Executability: Some models (notably in full completion) often produce code that fails to execute due to incomplete stubs, missing parentheses, or incomplete code constructs, yielding 0% correct implementations in the strictest setting.
- Size and Architecture Effects: Larger models do not uniformly outperform their smaller and medium-sized counterparts. Model family and pretraining exposure to API patterns, rather than parameter count alone, appear to be the dominant factor.
- API-Specific Performance Deviation: Success rates vary widely across target APIs, suggesting a strong dependence on prior exposure in model training data.
Commercial models currently outperform all tested open-source models, with GPT-4o achieving up to 77% correctness in argument-completion but still failing to cross the 40% threshold for full-completion correctness in open-source LLMs (Maninger et al., 24 Sep 2025).
5. Diagnostic Insights and Implications
The benchmarking reveals the underlying complexities of web API integration code generation:
- Persistent Hallucinations: The tendency to hallucinate endpoints—despite the presence of specification-compliant guidance in the prompt and rich training data—suggests that ungrounded endpoint and parameter generation is a core challenge. Even argument completion does not eliminate this effect due to inconsistent mapping between intent and parameter schema.
- Validation and Execution: Evaluation must move beyond static analysis or string-matching to runtime validation. The use of a controlled mock runtime that intercepts API invocations prior to network transmission is an efficient strategy for scalable correctness assessment when live endpoint access is unnecessary or impractical.
- Specification Compliance: Strict, element-wise compliance with OpenAPI specifications exposes a wider range of errors than pass/fail or text-based comparison alone, providing a more realistic measure of code usability in integration scenarios.
- Necessity of Synthetic Datasets: In the absence of high-quality, real-world repositories containing full contextual traces and canonical solutions, synthetic and semi-synthetic datasets—grounded in verified specifications and curated by experts—are required to provide robust evaluation and development targets.
Taken together, these insights imply that future progress will likely require the use of retrieval-augmented, context-aware generation methods that tightly bind code generation to up-to-date external API specifications, combined with more sophisticated constraint checking and perhaps controlled decoding to avoid "hallucination" errors (Maninger et al., 24 Sep 2025).
6. Benchmarking as a Catalyst for Advancing Model Robustness
The presented framework offers a rigorous foundation for ongoing research in LLM-powered web API integration code synthesis. Its capabilities include:
- Fine-grained element-wise error diagnosis, enabling model developers to isolate failure modes in endpoint, method, and argument generation.
- Robust task definitions that reflect real developer needs rather than contrived or underspecified code completion.
- Scalability and extensibility to a wide range of APIs, as new OpenAPI specifications can be programmatically included.
- The potential to expose generalization gaps and adverse effects of data memorization biases, informing the design of training curricula or architectural adjustments.
By systematically exposing hallucination rates, argument mapping errors, and syntax failures, the benchmark establishes a baseline that both illuminates current limitations and sets clear targets for future, specification-compliant model improvement.