Benchmarking Web API Integration Code Generation (2509.20172v1)
Abstract: API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although LLMs~(LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models were able to solve more than 40% of the tasks.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching and testing AI coding assistants (like the ones that help you write code) to correctly call web APIs. A web API is like a set of rules for talking to an online service, such as Google Calendar or Slack. The authors built a special set of tasks and a safe testing system to see how well different AI models can write the code needed to use these web APIs without making mistakes.
Key Questions
The main question the paper asks is simple: How often do AI models write correct web API code, and what kinds of mistakes do they make?
In everyday terms: If you tell an AI, “Create a new calendar called Example Calendar,” can it write the exact code that sends the right request to the right place, with the right settings? And when it fails, what went wrong?
How They Studied It
To answer the question, the authors created a benchmark—a fair, repeatable way to test models—using four well-known APIs: Asana, Google Calendar, Google Sheets, and Slack. They built 395 tasks and a safe evaluation pipeline.
Here’s the process in four steps:
- Dataset creation: They made pairs of “task” descriptions and “correct answers.” A task is a clear instruction in plain language (for example, “Create a new calendar named ‘Example Calendar’ with time zone ‘America/Los_Angeles’”). The correct answer is a “request configuration”: the exact URL, HTTP method (GET, POST, etc.), and the right arguments placed in the right spots (body, headers, query string).
- Code generation: They asked different AI models to write the code to solve each task using JavaScript and Axios (a popular library for sending HTTP requests).
- Safe execution: They ran the AI-written code in a controlled “mock” environment. Think of this like a flight simulator for code—it doesn’t send real requests to the internet but records what the code would have sent.
- Correctness analysis: They compared what the AI’s code tried to send (the captured configuration) against the correct configuration and also checked if it followed the API’s official rules (the API specification).
Two testing setups helped separate different skills:
- Full completion: The AI had to pick the right endpoint (URL + method) and arguments from scratch.
- Argument completion: The AI was given the right endpoint and only had to fill in the correct arguments. This is like being given the correct address and asked to complete the form correctly.
Helpful analogies:
- The URL is like the street address of the service.
- The HTTP method is like the action you plan to take there (GET = read, POST = create).
- Arguments are the details you include, like what you’re creating or updating. These can live in different places: the request body, headers, or query string.
“Hallucination” means the AI makes something up that doesn’t exist—like inventing a fake endpoint or parameter name.
Main Findings
What the authors found matters for anyone using AI to write code:
- Open-source AI models struggled. Even the best open-source model got only about 30% correct in full completion and 40% in argument completion.
- A commercial model (GPT-4o) did better: about 60% correct in full completion and 77% in argument completion.
- Common mistakes included:
- Wrong or made-up URLs (hallucinated endpoints).
- Wrong or illegal parameter names and placements (for example, putting a value in headers when it belongs in the body).
- Missing required arguments or extra ones that aren’t allowed.
- Predicting the HTTP method (GET, POST, etc.) was easier than getting the URL or arguments right.
- When models guessed argument names correctly, they usually got the values right too—so finding the correct names was the harder part.
- Bigger models were not always better; sometimes mid-sized ones performed worse than both smaller and larger versions.
- Models did not have the API documentation in their prompt; they had to rely on memory. This made the tasks realistic but hard.
Why this is important: Web APIs are everywhere. If AI tools can’t reliably write correct API code, developers must add strong checks, tests, or give the models better information to avoid bugs and wasted time.
Implications and Impact
This research shows that AI coding assistants are promising but not yet trustworthy for writing web API integration code without help. The benchmark and testing pipeline the authors created can be used to:
- Measure and compare models more fairly.
- Guide improvements like:
- Retrieval-augmented generation (giving the model the exact API docs during coding),
- Special fine-tuning for API tasks,
- Better prompts and instructions,
- Guardrails that prevent illegal requests,
- Feedback loops and constrained decoding to keep code within the rules.
Bottom line: AI can help with API code, but it still needs a safety net—clear documentation in context, tests, and tools that check correctness. This benchmark is a starting point for making that help reliable.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and unanswered questions that remain after this work and that future researchers can act on.
- Impact of in-context documentation: How does providing relevant OpenAPI fragments (or full specs) via retrieval-augmented generation affect correctness versus the no-context setting used here?
- Constrained decoding against OpenAPI: To what extent do grammar- or schema-constrained decoding methods (e.g., spec-derived CFGs) reduce illegal endpoints/arguments and improve correctness?
- Fine-tuning on API specs: What gains come from supervised finetuning on (API spec, task, invocation) triplets, and how well does it generalize to unseen APIs?
- Reasoning and self-correction: Do chain-of-thought, tool-use, or iterative self-repair loops meaningfully reduce endpoint and argument hallucinations?
- Decoding strategy sensitivity: How do non-greedy decoding settings (temperature, nucleus/beam search) and stop-sequence tuning affect executability and correctness?
- Human-in-the-loop workflows: What is the marginal benefit of brief human hints (e.g., endpoint name only, or 1–2 key parameters) on final correctness and time-to-correct?
- Baseline comparisons: How do LLMs compare to (a) template-based OpenAPI-to-Axios generators, (b) search/retrieval plus string templating, and (c) SDK code generators?
- Dataset realism: The synthetic tasks often underuse optional parameters and placeholder values—what shifts in performance occur on human-authored tasks with realistic inputs?
- Coverage beyond four APIs: Do findings hold across more domains (e.g., payments, geospatial, e-commerce), less-documented APIs, and long-tail/enterprise APIs?
- Temporal generalization and API drift: How does correctness vary with API version age (pre-/post-model cutoff)? Can temporal holdouts quantify memorization vs. generalization?
- Multi-step and stateful workflows: How do models perform on multi-call sequences (create → fetch → update), pagination, id propagation, and conditional branching?
- Authentication flows and scopes: Can models correctly implement OAuth exchanges, refresh tokens, and choose appropriate scopes—beyond adding a static Authorization header?
- End-to-end execution and side effects: What changes when evaluating against live sandbox/test environments to validate server acceptance, side-effects, and idempotency?
- Richer schema compliance: How well do models handle nested objects, arrays, oneOf/anyOf/allOf, enums, formats (e.g., RFC3339), and cross-field constraints/dependencies?
- Argument value semantics: Current value checks rely on exact matches to reference values—can equivalence-class or property-based testing better capture “correct-enough” values?
- Error taxonomy and root causes: What fine-grained categories (e.g., path templating errors, method–URL mismatches, param location confusion, version drift) dominate, and why?
- Robustness to prompt phrasing: How sensitive are results to paraphrases, verbosity, and language (non-English prompts), and can prompt canonicalization improve stability?
- Library and language generalization: Do results transfer to Python (requests/httpx), TypeScript (fetch), Java (OkHttp), Go (net/http), and other HTTP clients?
- Axios-specific bias in evaluation: Does the mocking/truncation approach handle axios.create, baseURL, interceptors, defaults merging, and non-standard call patterns without penalizing correct code?
- Non-REST APIs: How do conclusions change for GraphQL, gRPC, and event/streaming APIs (SSE/WebSockets), which require different invocation patterns and constraints?
- Security correctness: Do models introduce insecure patterns (hard-coded tokens, incorrect TLS options, missing idempotency keys), and can automated checks catch them?
- Internationalization and encoding: Can models correctly handle URL encoding, reserved characters, locales, time zones, and multipart/form-data (including file uploads)?
- Illegal-but-plausible calls: How “close” are illegal invocations to valid ones (distance metrics), and can nearest-valid corrections be automatically suggested?
- Model calibration and confidence: Can uncertainty estimates predict when generated invocations are likely wrong so systems can fall back to retrieval, prompts, or human review?
- Sensitivity to seeds and runs: What is intra-model variance across multiple generations, and can n-best reranking against the spec boost correctness?
- Response handling and robustness: Beyond requests, can models correctly parse responses, handle errors/retries/rate limits/back-off, and manage partial failures?
- Generalization to unseen APIs: How well do approaches perform on APIs that are demonstrably absent from pretraining, and how much RAG/finetuning is needed to adapt?
- Spec quality and validator coverage: How can automated validation incorporate free-text constraints, derived invariants, and richer JSON Schema rules to reduce evaluation blind spots?
- Fairness and task difficulty: How does performance correlate with endpoint complexity (number/depth of params), and can a difficulty-indexed benchmark reveal scaling trends?
- Interactive IDE settings: In realistic IDE use (with partial code, context windows, and minimal latency), what starter-code variants and UI affordances maximize correctness?
Practical Applications
Immediate Applications
These applications can be deployed now by leveraging the paper’s open-source dataset, evaluation pipeline, and empirical insights.
- Industry: Model selection and procurement for AI coding assistants (software)
- Use the benchmark to compare internal or commercial AI assistants on web API integration tasks relevant to your stack (e.g., Asana, Google Calendar/Sheets, Slack).
- Potential tools/products/workflows: “API Codegen Benchmark Tool” integrated with internal APIs; procurement scorecards based on metrics such as correct URLs/methods, illegal arguments, and argument precision/recall.
- Assumptions/dependencies: Availability of accurate OpenAPI specs; adaptation of the pipeline to your APIs; consistent prompts across models for fair comparison.
- Industry: CI/CD guardrails for AI-generated API calls (software, finance, healthcare)
- Embed the paper’s controlled execution environment (Mock) in CI to intercept and validate Axios calls before any real network transmission; fail builds on illegal endpoints/arguments.
- Potential tools/products/workflows: “API Call Validator” as a pre-commit hook or CI step; test suites that assert request configurations against OpenAPI-ground-truth.
- Assumptions/dependencies: Current pipeline targets JavaScript/Axios; adapters needed for other languages/HTTP clients; robust handling of auth tokens and secrets.
- Industry: Endpoint-first coding workflow to reduce hallucinations (software)
- Given the higher accuracy on argument completion, adopt a workflow where developers fix method+URL (endpoint) and let AI complete parameters, headers, and bodies.
- Potential tools/products/workflows: IDE plugin that scaffolds endpoints from OpenAPI (e.g., “Endpoint Scaffolder”) then invokes AI for argument filling.
- Assumptions/dependencies: Up-to-date OpenAPI with example values; team discipline to pre-specify endpoints.
- Industry and Academia: Data-driven API documentation improvement (software)
- Use fine-grained metrics to identify endpoints/parameters with high error rates; improve OpenAPI clarity, add examples, formalize constraints where feasible.
- Potential tools/products/workflows: “SpecLint for OpenAPI” to detect ambiguous or under-specified fields; auto-generation of example-rich docs from specs.
- Assumptions/dependencies: Willingness to maintain richer specs; coordination with API product teams.
- Academia: Rigorous evaluation of code LLMs on web APIs
- Run comparative studies on fine-tuning, retrieval-augmented generation (RAG), reasoning prompts, and constrained decoding using the dataset and metrics.
- Potential tools/products/workflows: Course labs and research benchmarks; reproducible evaluation pipelines for publishable studies.
- Assumptions/dependencies: Access to models and compute; careful prompt normalization and reporting.
- Policy and Governance: Internal standards for safe AI-assisted integrations
- Establish guidelines requiring evaluation of AI assistants against web API correctness metrics before production use; maintain audit logs for illegal requests.
- Potential tools/products/workflows: Org-level policies for model acceptance; internal audits using the benchmark; “AI Integration Risk Register.”
- Assumptions/dependencies: Governance alignment; cross-team adoption; standardization around OpenAPI.
- Daily Life / Citizen Developers: Safe execution in no-code/low-code platforms (software)
- Add mock interceptors and spec-based validation to no-code connectors to prevent accidental calls to wrong or deprecated endpoints.
- Potential tools/products/workflows: “Safe Connector Runner” that blocks illegal endpoints and flags missing required args.
- Assumptions/dependencies: Platform support for spec ingestion and request interception; curated connector catalogs.
- Model Developers and Tool Vendors: Targeted fine-tuning and RAG (software)
- Fine-tune models on internal API specs and examples; add spec retrieval at generation time to reduce endpoint/argument hallucinations.
- Potential tools/products/workflows: “OpenAPI-Aware Codegen” that ingests spec fragments per task; evaluation gates based on the pipeline’s metrics.
- Assumptions/dependencies: High-quality spec data; careful handling of proprietary/regulated APIs; prompt+context management.
Long-Term Applications
These applications require further research, scaling, or development—often building on constrained decoding, RAG, formal specifications, and multi-language support.
- Industry: Spec-constrained code generation at scale (software, finance, healthcare)
- Enforce OpenAPI constraints during decoding to prevent illegal endpoints/methods/arguments.
- Potential tools/products/workflows: “Spec-Constrained Codegen” engines integrated with IDEs/CI; enterprise-grade guardrails for regulated sectors.
- Assumptions/dependencies: Efficient constraint integration into decoding; robust spec parsing; performance trade-offs.
- Industry: Autonomous API Ops agents with safety loops (software, DevOps)
- Agents that plan, generate, validate, and adapt API integrations, with feedback loops and runtime policy checks.
- Potential tools/products/workflows: “API Ops Agent” managing integration changes, spec updates, and regression tests; change impact analysis on endpoints.
- Assumptions/dependencies: High-quality evaluation signals; safe execution sandboxes; secure credential handling.
- Cross-language expansion of the evaluation pipeline (software)
- Extend the mock/validation framework to Python (requests/httpx), Java (OkHttp/Apache HttpClient), Go (net/http), C# (HttpClient), etc.
- Potential tools/products/workflows: “Multi-language API Invocation Benchmark” with adapters and shared metrics.
- Assumptions/dependencies: Client-specific interceptors; consistent normalization of request configs; community maintenance.
- Certification and compliance programs for AI coding tools (policy)
- Establish external certification of API-coding assistants using standard benchmarks (accuracy, illegal request rates, spec compliance).
- Potential tools/products/workflows: “API Codegen Scorecard” services; industry labels akin to safety grades.
- Assumptions/dependencies: Consensus on test suites; accredited evaluators; updates as APIs/model capabilities evolve.
- Standards evolution: More formal, machine-checkable API constraints (policy, industry)
- Extend OpenAPI with richer constraint semantics (e.g., schemas with invariants, cross-parameter dependencies) to improve automated validation and constrained decoding.
- Potential tools/products/workflows: “OpenAPI++” spec profiles; validators that enforce field semantics beyond types.
- Assumptions/dependencies: Standard body buy-in; backward compatibility; tooling support across ecosystems.
- Runtime enforcement: Policy proxies to validate AI-generated requests (software, security)
- Deploy sidecar or gateway proxies to verify requests against specs at runtime, block illegal calls, and log violations.
- Potential tools/products/workflows: “API Policy Enforcement Proxy” with spec caches and anomaly detection.
- Assumptions/dependencies: Minimal latency overhead; compatibility with service meshes; secure token forwarding.
- Automated API misuse detection and repair (software engineering)
- Combine static and dynamic analyses with spec-aware reasoning to auto-fix incorrect invocations (e.g., missing headers, wrong query params).
- Potential tools/products/workflows: “API Fixer” that proposes patches and CLI bots that remediate PRs.
- Assumptions/dependencies: High-precision analyses to avoid harmful auto-fixes; developer oversight loops.
- Sector-specific reliable integrations in regulated domains
- Healthcare (FHIR/HL7), finance (FIX/OAuth scopes), energy (OCPP), robotics/IoT (RESTful device APIs): enforce correctness with spec-aware codegen and evaluation layers.
- Potential tools/products/workflows: “Regulated API Copilot” with certification-ready metrics; connectors validated against domain specs.
- Assumptions/dependencies: Domain-standard specs quality; regulatory acceptance of automated validation; strong audit trails.
- Education and workforce development
- Curricula and training modules using the benchmark to teach API correctness, spec literacy, and AI guardrails; competitions to drive innovation in spec-aware codegen.
- Potential tools/products/workflows: University courses, MOOCs, and hackathons built around the dataset/pipeline.
- Assumptions/dependencies: Sustained open-source availability; community contributions; model access for students.
Glossary
- Argument completion: An evaluation setup where the correct method and URL are provided and the model must generate the appropriate arguments for the API call. "Argument completion additionally includes the correct method and URL"
- Authorization header: An HTTP header used to provide credentials (e.g., bearer tokens) to authorize requests to an API. "The high argument precision can be attributed to the Authorization header argument"
- Axios: A popular JavaScript HTTP client for making web requests and explicitly configuring request components. "use Axios, the leading JavaScript library for invoking web APIs"
- BigQuery: Google’s serverless data warehouse used here to search large code corpora for API-related examples. "Using BigQuery (\url{https://cloud.google.com/bigquery/}) for this purpose."
- Constrained decoding: A generation technique that restricts the model’s output to satisfy predefined constraints. "constrained decoding"
- Context window: The maximum number of tokens a model can process in a single prompt, limiting how much specification text can be provided. "for its very large context window (2M tokens)"
- Controlled execution environment: A sandboxed setup that safely intercepts and serializes outgoing requests from generated code for evaluation. "we implement a controlled execution environment"
- Endpoint: In this paper, the combination of a URL and HTTP method that uniquely identifies an API operation. "we use the term endpoint to refer to the combination of URL and (HTTP) method."
- Execution-based correctness analysis: Assessing code by running it and verifying functional outcomes rather than relying on textual similarity. "which can be used for execution-based correctness analysis."
- Few-shot prompting: Providing a small set of examples in the prompt to guide model behavior. "avoid pitfalls of few-shot prompting"
- Functional correctness: The degree to which generated code behaves according to the intended specification and task requirements. "is based on functional correctness rather than syntactic similarity"
- Full completion: An evaluation setup where the model generates the entire API invocation (including choosing the endpoint). "Full completion includes the beginning of API invocation"
- Greedy decoding: A decoding strategy that selects the highest-probability token at each step without exploration. "We used greedy decoding for all experiments."
- Hallucinate: When a model generates plausible but incorrect content not grounded in the specification or data. "they frequently hallucinate endpoint URLs (up to 39%)"
- HTTP method: The verb (e.g., GET, POST, PUT) indicating the action to perform on a web resource. "an HTTP method (GET, POST, etc.)"
- IANA Time Zone Database: A standardized registry of time zone identifiers commonly used in APIs. "(Formatted as an IANA Time Zone Database name, e.g. "Europe/Zurich".)"
- IDE: An Integrated Development Environment used by developers; referenced as a realistic setting for code completion workflows. "real-world scenarios inside an IDE"
- Illegal arguments: Request arguments that are not permitted by the API specification for a given endpoint. "Illegal arguments"
- Illegal methods: HTTP methods that are not defined for the generated URL in the API specification. "Illegal methods"
- Illegal URLs: URLs that are not defined in the API specification. "Illegal URLs"
- LLM: A model trained on large-scale data to generate and understand code and natural language. "LLMs~(LLMs) have great potential to boost productivity"
- Mean argument precision: The probability that generated argument names are correct according to the specification. "Mean argument precision"
- Mean argument recall: The probability that the correct argument names are produced by the model. "Mean argument recall"
- Mean argument value conditional accuracy: The probability that an argument’s value is correct given that its name is correct. "Mean argument value conditional accuracy"
- Mock: The controlled environment used to intercept and serialize API requests for safe evaluation. "Mock(i) = c'"
- OData: The Open Data Protocol for querying and updating data via RESTful services. "Open Data Protocol (OData)"
- Oauth2: An authorization framework used to secure API access and define scopes. "Oauth2:"
- OpenAPI: A widely used industry standard for describing web API specifications. "We target real-world web APIs that adhere to OpenAPI"
- Query parameters: Key-value pairs appended to the URL to refine or filter API requests. "params: { arg4: 'query parameters' }"
- Request body: The payload sent in an HTTP request, often in JSON, containing the operation’s parameters. "a request body, a request header, and query parameters."
- Request configuration: A structured representation of all expected components of a correct API request used for evaluation. "a so-called request configuration, which captures all properties an API request must have in order to solve t"
- RESTful web services: Web services following REST constraints, typically exposed over HTTP with resource-oriented endpoints. "Information systems~RESTful web services"
- Retrieval-augmented generation: Enhancing generations by retrieving relevant external documents or specifications at inference time. "retrieval-augmented generation"
- SDK heterogeneity: Variations across SDKs that hinder generalization of API invocation approaches. "do not generalize due to SDK heterogeneity."
- SDK-wrapped web APIs: Web APIs accessed through language-specific SDKs rather than direct HTTP calls. "SDK-wrapped web APIs"
- Specification-compliance: The extent to which generated requests adhere to the API’s documented specification. "correctness and specification-compliance"
- Starter code: Pre-supplied scaffold in the prompt that the model must complete to implement the API call. "completing the starter code"
- Static analysis: Program analysis without executing code, used to detect API misuses and other issues. "Several (static analysis) methods have been proposed"
- YAML: A human-readable data serialization format used for writing OpenAPI specifications. "OpenAPI specification (YAML)"
- Zero-shot prompt: A prompt that provides instructions but no task-specific examples. "in a zero-shot prompt to avoid pitfalls of few-shot prompting"
Collections
Sign up for free to add this paper to one or more collections.