Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 88 tok/s Pro

Kimi K2 138 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Benchmarking Web API Integration Code Generation (2509.20172v1)

Published 24 Sep 2025 in cs.SE and cs.LG

Abstract: API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although LLMs~(LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models were able to solve more than 40% of the tasks.

Summary

The paper introduces an execution-based pipeline that benchmarks LLM-generated API integration code using OpenAPI specifications.
It assesses functional correctness through detailed metrics like argument precision, URL compliance, and HTTP method accuracy over 395 samples.
Results highlight significant limitations in open-source models and a notable performance gap compared to commercial models such as GPT-4o.

Benchmarking Web API Integration Code Generation: An Expert Analysis

Introduction

This paper presents a comprehensive evaluation of LLMs for the task of generating web API integration code, specifically focusing on RESTful APIs described by OpenAPI specifications and implemented in JavaScript using the Axios library. The authors introduce a novel dataset and an execution-based evaluation pipeline to systematically assess the functional correctness and specification compliance of LLM-generated API invocation code. The paper addresses a critical gap in the literature: while LLMs have demonstrated utility in general code generation and local API usage, their ability to generate correct web API integration code remains largely unexplored.

Benchmark and Dataset Construction

The benchmark is structured around a four-stage pipeline: dataset creation, code generation, code execution, and correctness analysis. The dataset comprises 395 samples covering four widely-used APIs (Asana, Google Calendar, Google Sheets, Slack), with each sample consisting of a natural language task description and a corresponding ground-truth request configuration (method, URL, headers, query parameters, body). The dataset is synthesized using Gemini 1.5 Pro, leveraging its large context window to process full OpenAPI specifications, followed by rigorous manual and automated validation to ensure task solvability and configuration accuracy.

The evaluation task requires LLMs to complete code snippets that invoke the specified API endpoint in compliance with the specification, given only the API name and a natural language description. Two completion setups are considered: full completion (model generates the entire Axios invocation) and argument completion (model fills in arguments for a given method and URL).

Evaluation Methodology

Functional correctness is assessed by executing the generated code in a controlled environment that intercepts outgoing requests and serializes their configurations for comparison against the ground-truth. The evaluation pipeline computes fine-grained metrics, including:

Correct implementations: Executable code matching the ground-truth configuration.
Correct/Illegal URLs and methods: Endpoint and HTTP method compliance.
Argument precision/recall: Correctness of generated arguments.
Argument value conditional accuracy: Correctness of argument values given correct names.
Illegal arguments: Arguments not permitted by the API specification.

Metrics are reported both on the total sample set and on the subset of executable code, enabling differentiation between code generation and specification compliance capabilities.

Empirical Results

The paper evaluates 21 open-source LLMs (CodeT5+, StarCoder, DeepSeek-Coder, Qwen2.5-Coder, CodeLlama, Llama 3.1) and commercial GPT-4o variants. Key findings include:

Correctness Ceiling: No open-source model exceeds 40% correctness in argument completion; the best model (CodeLlama 70B) achieves 30% in full completion.
Error Types: Frequent hallucination of endpoint URLs (up to 39%) and parameter names (up to 31%). Models often omit required arguments or introduce illegal ones.
Component-wise Performance: HTTP method prediction is robust (>80% accuracy), but URL and argument prediction are substantially weaker. Argument value accuracy is higher conditional on correct argument names.
Model Size Effects: Larger models do not consistently outperform smaller ones; performance varies across model families and APIs, likely reflecting training data coverage.
Commercial Model Gap: GPT-4o achieves substantially higher correctness (60% full, 77% argument completion), indicating a significant gap between open-source and proprietary models.

Discussion and Implications

The results demonstrate that current open-source LLMs are not reliable for automated web API integration code generation in production settings. While models exhibit partial memorization of API specifications and usage patterns, they fail to synthesize fully correct invocations, especially when required to recall endpoint details and argument structures without in-context specification access. The prevalence of hallucinations and specification violations underscores the need for robust quality assurance mechanisms in AI-assisted software engineering workflows.

The benchmark's execution-based evaluation methodology provides granular insights into model behavior, revealing that syntactic similarity metrics are insufficient for assessing functional correctness in API integration tasks. The dataset and pipeline are extensible to other languages and libraries, offering a foundation for future research on retrieval-augmented generation, fine-tuning, constrained decoding, and feedback-driven code synthesis.

Limitations

The benchmark focuses on outgoing request correctness, omitting response handling and broader program context. The dataset, while carefully curated, is limited to four APIs and may not capture the full diversity of real-world integration scenarios. Synthetic task generation may underrepresent optional parameters and realistic argument values. Model performance is sensitive to prompt design, and further improvements may be possible with tailored prompt engineering.

The paper situates its contribution within a taxonomy of API invocation settings, distinguishing general web APIs from domain-specific, SDK-wrapped, local function, and tool APIs. Prior work has addressed natural language interfaces to domain-specific APIs, SDK-based invocation, and tool usage by AI agents, but none have systematically benchmarked LLMs on direct RESTful API integration code generation. The evaluation methodology advances beyond traditional exact-match and syntax-based metrics, providing a more reliable assessment of specification compliance and functional correctness.

Conclusion

This work establishes a rigorous benchmark for evaluating LLMs on web API integration code generation, revealing substantial limitations in current open-source models. The findings highlight the necessity of integrating retrieval, reasoning, and constraint mechanisms to improve correctness and reliability. The dataset and evaluation pipeline offer a valuable resource for the community to advance research in AI-powered software engineering, with potential extensions to broader API domains and programming languages.

Future Directions

Promising avenues for future research include:

Retrieval-Augmented Generation: Incorporating API documentation or specification retrieval to enhance endpoint and argument recall.
Fine-Tuning and Instruction Tuning: Adapting models to the API integration domain with targeted datasets.
Constrained Decoding: Enforcing specification compliance during generation.
Feedback Loops and Reasoning: Integrating runtime feedback and symbolic reasoning to guide code synthesis.
Cross-Language and Cross-Library Generalization: Extending the benchmark to other languages (e.g., Python, Java) and HTTP client libraries.

The benchmark sets a new standard for evaluating and improving LLMs in the context of web API integration, with significant implications for the automation of software development workflows.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about teaching and testing AI coding assistants (like the ones that help you write code) to correctly call web APIs. A web API is like a set of rules for talking to an online service, such as Google Calendar or Slack. The authors built a special set of tasks and a safe testing system to see how well different AI models can write the code needed to use these web APIs without making mistakes.

Key Questions

The main question the paper asks is simple: How often do AI models write correct web API code, and what kinds of mistakes do they make?

In everyday terms: If you tell an AI, “Create a new calendar called Example Calendar,” can it write the exact code that sends the right request to the right place, with the right settings? And when it fails, what went wrong?

How They Studied It

To answer the question, the authors created a benchmark—a fair, repeatable way to test models—using four well-known APIs: Asana, Google Calendar, Google Sheets, and Slack. They built 395 tasks and a safe evaluation pipeline.

Here’s the process in four steps:

Dataset creation: They made pairs of “task” descriptions and “correct answers.” A task is a clear instruction in plain language (for example, “Create a new calendar named ‘Example Calendar’ with time zone ‘America/Los_Angeles’”). The correct answer is a “request configuration”: the exact URL, HTTP method (GET, POST, etc.), and the right arguments placed in the right spots (body, headers, query string).
Code generation: They asked different AI models to write the code to solve each task using JavaScript and Axios (a popular library for sending HTTP requests).
Safe execution: They ran the AI-written code in a controlled “mock” environment. Think of this like a flight simulator for code—it doesn’t send real requests to the internet but records what the code would have sent.
Correctness analysis: They compared what the AI’s code tried to send (the captured configuration) against the correct configuration and also checked if it followed the API’s official rules (the API specification).

Two testing setups helped separate different skills:

Full completion: The AI had to pick the right endpoint (URL + method) and arguments from scratch.
Argument completion: The AI was given the right endpoint and only had to fill in the correct arguments. This is like being given the correct address and asked to complete the form correctly.

Helpful analogies:

The URL is like the street address of the service.
The HTTP method is like the action you plan to take there (GET = read, POST = create).
Arguments are the details you include, like what you’re creating or updating. These can live in different places: the request body, headers, or query string.

“Hallucination” means the AI makes something up that doesn’t exist—like inventing a fake endpoint or parameter name.

Main Findings

What the authors found matters for anyone using AI to write code:

Open-source AI models struggled. Even the best open-source model got only about 30% correct in full completion and 40% in argument completion.
A commercial model (GPT-4o) did better: about 60% correct in full completion and 77% in argument completion.
Common mistakes included:
- Wrong or made-up URLs (hallucinated endpoints).
- Wrong or illegal parameter names and placements (for example, putting a value in headers when it belongs in the body).
- Missing required arguments or extra ones that aren’t allowed.
Predicting the HTTP method (GET, POST, etc.) was easier than getting the URL or arguments right.
When models guessed argument names correctly, they usually got the values right too—so finding the correct names was the harder part.
Bigger models were not always better; sometimes mid-sized ones performed worse than both smaller and larger versions.
Models did not have the API documentation in their prompt; they had to rely on memory. This made the tasks realistic but hard.

Why this is important: Web APIs are everywhere. If AI tools can’t reliably write correct API code, developers must add strong checks, tests, or give the models better information to avoid bugs and wasted time.

Implications and Impact

This research shows that AI coding assistants are promising but not yet trustworthy for writing web API integration code without help. The benchmark and testing pipeline the authors created can be used to:

Measure and compare models more fairly.
Guide improvements like:
- Retrieval-augmented generation (giving the model the exact API docs during coding),
- Special fine-tuning for API tasks,
- Better prompts and instructions,
- Guardrails that prevent illegal requests,
- Feedback loops and constrained decoding to keep code within the rules.

Bottom line: AI can help with API code, but it still needs a safety net—clear documentation in context, tests, and tools that check correctness. This benchmark is a starting point for making that help reliable.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and unanswered questions that remain after this work and that future researchers can act on.

Impact of in-context documentation: How does providing relevant OpenAPI fragments (or full specs) via retrieval-augmented generation affect correctness versus the no-context setting used here?
Constrained decoding against OpenAPI: To what extent do grammar- or schema-constrained decoding methods (e.g., spec-derived CFGs) reduce illegal endpoints/arguments and improve correctness?
Fine-tuning on API specs: What gains come from supervised finetuning on (API spec, task, invocation) triplets, and how well does it generalize to unseen APIs?
Reasoning and self-correction: Do chain-of-thought, tool-use, or iterative self-repair loops meaningfully reduce endpoint and argument hallucinations?
Decoding strategy sensitivity: How do non-greedy decoding settings (temperature, nucleus/beam search) and stop-sequence tuning affect executability and correctness?
Human-in-the-loop workflows: What is the marginal benefit of brief human hints (e.g., endpoint name only, or 1–2 key parameters) on final correctness and time-to-correct?
Baseline comparisons: How do LLMs compare to (a) template-based OpenAPI-to-Axios generators, (b) search/retrieval plus string templating, and (c) SDK code generators?
Dataset realism: The synthetic tasks often underuse optional parameters and placeholder values—what shifts in performance occur on human-authored tasks with realistic inputs?
Coverage beyond four APIs: Do findings hold across more domains (e.g., payments, geospatial, e-commerce), less-documented APIs, and long-tail/enterprise APIs?
Temporal generalization and API drift: How does correctness vary with API version age (pre-/post-model cutoff)? Can temporal holdouts quantify memorization vs. generalization?
Multi-step and stateful workflows: How do models perform on multi-call sequences (create → fetch → update), pagination, id propagation, and conditional branching?
Authentication flows and scopes: Can models correctly implement OAuth exchanges, refresh tokens, and choose appropriate scopes—beyond adding a static Authorization header?
End-to-end execution and side effects: What changes when evaluating against live sandbox/test environments to validate server acceptance, side-effects, and idempotency?
Richer schema compliance: How well do models handle nested objects, arrays, oneOf/anyOf/allOf, enums, formats (e.g., RFC3339), and cross-field constraints/dependencies?
Argument value semantics: Current value checks rely on exact matches to reference values—can equivalence-class or property-based testing better capture “correct-enough” values?
Error taxonomy and root causes: What fine-grained categories (e.g., path templating errors, method–URL mismatches, param location confusion, version drift) dominate, and why?
Robustness to prompt phrasing: How sensitive are results to paraphrases, verbosity, and language (non-English prompts), and can prompt canonicalization improve stability?
Library and language generalization: Do results transfer to Python (requests/httpx), TypeScript (fetch), Java (OkHttp), Go (net/http), and other HTTP clients?
Axios-specific bias in evaluation: Does the mocking/truncation approach handle axios.create, baseURL, interceptors, defaults merging, and non-standard call patterns without penalizing correct code?
Non-REST APIs: How do conclusions change for GraphQL, gRPC, and event/streaming APIs (SSE/WebSockets), which require different invocation patterns and constraints?
Security correctness: Do models introduce insecure patterns (hard-coded tokens, incorrect TLS options, missing idempotency keys), and can automated checks catch them?
Internationalization and encoding: Can models correctly handle URL encoding, reserved characters, locales, time zones, and multipart/form-data (including file uploads)?
Illegal-but-plausible calls: How “close” are illegal invocations to valid ones (distance metrics), and can nearest-valid corrections be automatically suggested?
Model calibration and confidence: Can uncertainty estimates predict when generated invocations are likely wrong so systems can fall back to retrieval, prompts, or human review?
Sensitivity to seeds and runs: What is intra-model variance across multiple generations, and can n-best reranking against the spec boost correctness?
Response handling and robustness: Beyond requests, can models correctly parse responses, handle errors/retries/rate limits/back-off, and manage partial failures?
Generalization to unseen APIs: How well do approaches perform on APIs that are demonstrably absent from pretraining, and how much RAG/finetuning is needed to adapt?
Spec quality and validator coverage: How can automated validation incorporate free-text constraints, derived invariants, and richer JSON Schema rules to reduce evaluation blind spots?
Fairness and task difficulty: How does performance correlate with endpoint complexity (number/depth of params), and can a difficulty-indexed benchmark reveal scaling trends?
Interactive IDE settings: In realistic IDE use (with partial code, context windows, and minimal latency), what starter-code variants and UI affordances maximize correctness?

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed now by leveraging the paper’s open-source dataset, evaluation pipeline, and empirical insights.

Industry: Model selection and procurement for AI coding assistants (software)
- Use the benchmark to compare internal or commercial AI assistants on web API integration tasks relevant to your stack (e.g., Asana, Google Calendar/Sheets, Slack).
- Potential tools/products/workflows: “API Codegen Benchmark Tool” integrated with internal APIs; procurement scorecards based on metrics such as correct URLs/methods, illegal arguments, and argument precision/recall.
- Assumptions/dependencies: Availability of accurate OpenAPI specs; adaptation of the pipeline to your APIs; consistent prompts across models for fair comparison.
Industry: CI/CD guardrails for AI-generated API calls (software, finance, healthcare)
- Embed the paper’s controlled execution environment (Mock) in CI to intercept and validate Axios calls before any real network transmission; fail builds on illegal endpoints/arguments.
- Potential tools/products/workflows: “API Call Validator” as a pre-commit hook or CI step; test suites that assert request configurations against OpenAPI-ground-truth.
- Assumptions/dependencies: Current pipeline targets JavaScript/Axios; adapters needed for other languages/HTTP clients; robust handling of auth tokens and secrets.
Industry: Endpoint-first coding workflow to reduce hallucinations (software)
- Given the higher accuracy on argument completion, adopt a workflow where developers fix method+URL (endpoint) and let AI complete parameters, headers, and bodies.
- Potential tools/products/workflows: IDE plugin that scaffolds endpoints from OpenAPI (e.g., “Endpoint Scaffolder”) then invokes AI for argument filling.
- Assumptions/dependencies: Up-to-date OpenAPI with example values; team discipline to pre-specify endpoints.
Industry and Academia: Data-driven API documentation improvement (software)
- Use fine-grained metrics to identify endpoints/parameters with high error rates; improve OpenAPI clarity, add examples, formalize constraints where feasible.
- Potential tools/products/workflows: “SpecLint for OpenAPI” to detect ambiguous or under-specified fields; auto-generation of example-rich docs from specs.
- Assumptions/dependencies: Willingness to maintain richer specs; coordination with API product teams.
Academia: Rigorous evaluation of code LLMs on web APIs
- Run comparative studies on fine-tuning, retrieval-augmented generation (RAG), reasoning prompts, and constrained decoding using the dataset and metrics.
- Potential tools/products/workflows: Course labs and research benchmarks; reproducible evaluation pipelines for publishable studies.
- Assumptions/dependencies: Access to models and compute; careful prompt normalization and reporting.
Policy and Governance: Internal standards for safe AI-assisted integrations
- Establish guidelines requiring evaluation of AI assistants against web API correctness metrics before production use; maintain audit logs for illegal requests.
- Potential tools/products/workflows: Org-level policies for model acceptance; internal audits using the benchmark; “AI Integration Risk Register.”
- Assumptions/dependencies: Governance alignment; cross-team adoption; standardization around OpenAPI.
Daily Life / Citizen Developers: Safe execution in no-code/low-code platforms (software)
- Add mock interceptors and spec-based validation to no-code connectors to prevent accidental calls to wrong or deprecated endpoints.
- Potential tools/products/workflows: “Safe Connector Runner” that blocks illegal endpoints and flags missing required args.
- Assumptions/dependencies: Platform support for spec ingestion and request interception; curated connector catalogs.
Model Developers and Tool Vendors: Targeted fine-tuning and RAG (software)
- Fine-tune models on internal API specs and examples; add spec retrieval at generation time to reduce endpoint/argument hallucinations.
- Potential tools/products/workflows: “OpenAPI-Aware Codegen” that ingests spec fragments per task; evaluation gates based on the pipeline’s metrics.
- Assumptions/dependencies: High-quality spec data; careful handling of proprietary/regulated APIs; prompt+context management.

Long-Term Applications

These applications require further research, scaling, or development—often building on constrained decoding, RAG, formal specifications, and multi-language support.

Industry: Spec-constrained code generation at scale (software, finance, healthcare)
- Enforce OpenAPI constraints during decoding to prevent illegal endpoints/methods/arguments.
- Potential tools/products/workflows: “Spec-Constrained Codegen” engines integrated with IDEs/CI; enterprise-grade guardrails for regulated sectors.
- Assumptions/dependencies: Efficient constraint integration into decoding; robust spec parsing; performance trade-offs.
Industry: Autonomous API Ops agents with safety loops (software, DevOps)
- Agents that plan, generate, validate, and adapt API integrations, with feedback loops and runtime policy checks.
- Potential tools/products/workflows: “API Ops Agent” managing integration changes, spec updates, and regression tests; change impact analysis on endpoints.
- Assumptions/dependencies: High-quality evaluation signals; safe execution sandboxes; secure credential handling.
Cross-language expansion of the evaluation pipeline (software)
- Extend the mock/validation framework to Python (requests/httpx), Java (OkHttp/Apache HttpClient), Go (net/http), C# (HttpClient), etc.
- Potential tools/products/workflows: “Multi-language API Invocation Benchmark” with adapters and shared metrics.
- Assumptions/dependencies: Client-specific interceptors; consistent normalization of request configs; community maintenance.
Certification and compliance programs for AI coding tools (policy)
- Establish external certification of API-coding assistants using standard benchmarks (accuracy, illegal request rates, spec compliance).
- Potential tools/products/workflows: “API Codegen Scorecard” services; industry labels akin to safety grades.
- Assumptions/dependencies: Consensus on test suites; accredited evaluators; updates as APIs/model capabilities evolve.
Standards evolution: More formal, machine-checkable API constraints (policy, industry)
- Extend OpenAPI with richer constraint semantics (e.g., schemas with invariants, cross-parameter dependencies) to improve automated validation and constrained decoding.
- Potential tools/products/workflows: “OpenAPI++” spec profiles; validators that enforce field semantics beyond types.
- Assumptions/dependencies: Standard body buy-in; backward compatibility; tooling support across ecosystems.
Runtime enforcement: Policy proxies to validate AI-generated requests (software, security)
- Deploy sidecar or gateway proxies to verify requests against specs at runtime, block illegal calls, and log violations.
- Potential tools/products/workflows: “API Policy Enforcement Proxy” with spec caches and anomaly detection.
- Assumptions/dependencies: Minimal latency overhead; compatibility with service meshes; secure token forwarding.
Automated API misuse detection and repair (software engineering)
- Combine static and dynamic analyses with spec-aware reasoning to auto-fix incorrect invocations (e.g., missing headers, wrong query params).
- Potential tools/products/workflows: “API Fixer” that proposes patches and CLI bots that remediate PRs.
- Assumptions/dependencies: High-precision analyses to avoid harmful auto-fixes; developer oversight loops.
Sector-specific reliable integrations in regulated domains
- Healthcare (FHIR/HL7), finance (FIX/OAuth scopes), energy (OCPP), robotics/IoT (RESTful device APIs): enforce correctness with spec-aware codegen and evaluation layers.
- Potential tools/products/workflows: “Regulated API Copilot” with certification-ready metrics; connectors validated against domain specs.
- Assumptions/dependencies: Domain-standard specs quality; regulatory acceptance of automated validation; strong audit trails.
Education and workforce development
- Curricula and training modules using the benchmark to teach API correctness, spec literacy, and AI guardrails; competitions to drive innovation in spec-aware codegen.
- Potential tools/products/workflows: University courses, MOOCs, and hackathons built around the dataset/pipeline.
- Assumptions/dependencies: Sustained open-source availability; community contributions; model access for students.

View Paper Prompt View All Prompts

Glossary

Argument completion: An evaluation setup where the correct method and URL are provided and the model must generate the appropriate arguments for the API call. "Argument completion additionally includes the correct method and URL"
Authorization header: An HTTP header used to provide credentials (e.g., bearer tokens) to authorize requests to an API. "The high argument precision can be attributed to the Authorization header argument"
Axios: A popular JavaScript HTTP client for making web requests and explicitly configuring request components. "use Axios, the leading JavaScript library for invoking web APIs"
BigQuery: Google’s serverless data warehouse used here to search large code corpora for API-related examples. "Using BigQuery (\url{https://cloud.google.com/bigquery/}) for this purpose."
Constrained decoding: A generation technique that restricts the model’s output to satisfy predefined constraints. "constrained decoding"
Context window: The maximum number of tokens a model can process in a single prompt, limiting how much specification text can be provided. "for its very large context window (2M tokens)"
Controlled execution environment: A sandboxed setup that safely intercepts and serializes outgoing requests from generated code for evaluation. "we implement a controlled execution environment"
Endpoint: In this paper, the combination of a URL and HTTP method that uniquely identifies an API operation. "we use the term endpoint to refer to the combination of URL and (HTTP) method."
Execution-based correctness analysis: Assessing code by running it and verifying functional outcomes rather than relying on textual similarity. "which can be used for execution-based correctness analysis."
Few-shot prompting: Providing a small set of examples in the prompt to guide model behavior. "avoid pitfalls of few-shot prompting"
Functional correctness: The degree to which generated code behaves according to the intended specification and task requirements. "is based on functional correctness rather than syntactic similarity"
Full completion: An evaluation setup where the model generates the entire API invocation (including choosing the endpoint). "Full completion includes the beginning of API invocation"
Greedy decoding: A decoding strategy that selects the highest-probability token at each step without exploration. "We used greedy decoding for all experiments."
Hallucinate: When a model generates plausible but incorrect content not grounded in the specification or data. "they frequently hallucinate endpoint URLs (up to 39%)"
HTTP method: The verb (e.g., GET, POST, PUT) indicating the action to perform on a web resource. "an HTTP method (GET, POST, etc.)"
IANA Time Zone Database: A standardized registry of time zone identifiers commonly used in APIs. "(Formatted as an IANA Time Zone Database name, e.g. "Europe/Zurich".)"
IDE: An Integrated Development Environment used by developers; referenced as a realistic setting for code completion workflows. "real-world scenarios inside an IDE"
Illegal arguments: Request arguments that are not permitted by the API specification for a given endpoint. "Illegal arguments"
Illegal methods: HTTP methods that are not defined for the generated URL in the API specification. "Illegal methods"
Illegal URLs: URLs that are not defined in the API specification. "Illegal URLs"
LLM: A model trained on large-scale data to generate and understand code and natural language. "LLMs~(LLMs) have great potential to boost productivity"
Mean argument precision: The probability that generated argument names are correct according to the specification. "Mean argument precision"
Mean argument recall: The probability that the correct argument names are produced by the model. "Mean argument recall"
Mean argument value conditional accuracy: The probability that an argument’s value is correct given that its name is correct. "Mean argument value conditional accuracy"
Mock: The controlled environment used to intercept and serialize API requests for safe evaluation. "Mock(i) = c'"
OData: The Open Data Protocol for querying and updating data via RESTful services. "Open Data Protocol (OData)"
Oauth2: An authorization framework used to secure API access and define scopes. "Oauth2:"
OpenAPI: A widely used industry standard for describing web API specifications. "We target real-world web APIs that adhere to OpenAPI"
Query parameters: Key-value pairs appended to the URL to refine or filter API requests. "params: { arg4: 'query parameters' }"
Request body: The payload sent in an HTTP request, often in JSON, containing the operation’s parameters. "a request body, a request header, and query parameters."
Request configuration: A structured representation of all expected components of a correct API request used for evaluation. "a so-called request configuration, which captures all properties an API request must have in order to solve t"
RESTful web services: Web services following REST constraints, typically exposed over HTTP with resource-oriented endpoints. "Information systems~RESTful web services"
Retrieval-augmented generation: Enhancing generations by retrieving relevant external documents or specifications at inference time. "retrieval-augmented generation"
SDK heterogeneity: Variations across SDKs that hinder generalization of API invocation approaches. "do not generalize due to SDK heterogeneity."
SDK-wrapped web APIs: Web APIs accessed through language-specific SDKs rather than direct HTTP calls. "SDK-wrapped web APIs"
Specification-compliance: The extent to which generated requests adhere to the API’s documented specification. "correctness and specification-compliance"
Starter code: Pre-supplied scaffold in the prompt that the model must complete to implement the API call. "completing the starter code"
Static analysis: Program analysis without executing code, used to detect API misuses and other issues. "Several (static analysis) methods have been proposed"
YAML: A human-readable data serialization format used for writing OpenAPI specifications. "OpenAPI specification (YAML)"
Zero-shot prompt: A prompt that provides instructions but no task-specific examples. "in a zero-shot prompt to avoid pitfalls of few-shot prompting"

Benchmarking Web API Integration Code Generation (2509.20172v1)

Summary

Benchmarking Web API Integration Code Generation: An Expert Analysis

Introduction

Benchmark and Dataset Construction

Evaluation Methodology

Empirical Results

Discussion and Implications

Limitations

Conclusion

Future Directions

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How They Studied It

Main Findings

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Continue Learning

Authors (5)

Collections

YouTube

alphaXiv

Don't miss out on important new AI/ML research

Benchmarking Web API Integration Code Generation (2509.20172v1)

Summary

Benchmarking Web API Integration Code Generation: An Expert Analysis

Introduction

Benchmark and Dataset Construction

Evaluation Methodology

Empirical Results

Discussion and Implications

Limitations

Related Work

Conclusion

Future Directions

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How They Studied It

Main Findings

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Continue Learning

Related Papers

Authors (5)

Collections

YouTube

alphaXiv

Don't miss out on important new AI/ML research