Self-Supervised API Use Discovery

Updated 26 November 2025

Self-supervised API use discovery is a method that enables LLMs to autonomously discover and utilize unfamiliar APIs without additional training.
The approach decomposes coding tasks into simpler API invocation subtasks and leverages execution feedback for iterative debugging.
Empirical results show significant performance improvements, with higher pass@k scores compared to naive retrieval-based methods.

Self-supervised API use discovery refers to the autonomous identification and correct invocation of previously unseen application programming interfaces (APIs) by LLMs, without explicit supervision or additional model training. This paradigm addresses the inability of LLMs to reliably generate code using APIs not present within their training corpora, an issue of increasing importance given the rapidly evolving and proprietary nature of modern software libraries. The ExploraCoder framework exemplifies a state-of-the-art, training-free, self-supervised methodology that allows LLMs to incrementally discover, validate, and employ unfamiliar APIs through chained task decomposition and interpretive execution feedback (Wang et al., 2024).

1. Motivation and Problem Setting

The coding capabilities of LLMs, such as GPT-3.5-turbo and GPT-4, are fundamentally limited by the static API knowledge encoded during their offline pretraining. As software ecosystems frequently introduce new or private APIs, it is not feasible to continuously retrain LLMs to maintain up-to-date coverage. This gap impedes real-world adoption for tasks like program synthesis or code completion involving recently released or proprietary libraries. The core task is: given a programming requirement $\psi$ and a full set of API documentation $\mathcal{A}$ potentially outside the model’s training set, generate a working code solution via correct invocation of the requisite (and potentially unfamiliar) APIs—without any parameter updates.

2. Task Planning via LLM-based Decomposition

ExploraCoder begins by decomposing the high-level programming specification $\psi$ into a sequence of simple API-invocation subtasks $t_1, \ldots, t_n$ . This is achieved through an LLM, elicited via in-context few-shot demonstrations $\mathcal{D} = \{\langle \psi^j, \{t^j_u\} \rangle\}$ and a library overview $s$ extracted from the README and a small number of planner exemplars. The LLM is prompted with $(\psi, \mathcal{D}, s)$ to stochastically generate discrete subtasks, each targeting one or two API calls and maintaining atomicity. This divide-and-conquer mechanism, as in the provided pseudocode, reduces code generation complexity and narrows the search space at each step:

Input: ψ, few-shot planners 𝒟, overview s
Output: subtasks t₁…tₙ
prompt ← “Given requirement ψ and examples 𝒟 plus overview s, outline the sequence of simple API subtasks.”
t₁…tₙ ← LLM_generate(prompt)
return [t₁…tₙ]

A small $n$ (proportional to the anticipated API invocations) is chosen to ensure tractability.

3. Chain-of-API-Exploration Process

Each subtask $t_i$ undergoes an iterative chain-of-API-exploration ("CoAE"). For each $t_i$ , the framework:

API Candidate Recommendation: Each candidate API doc $a_j$ (given as import path and descriptor) is embedded using text-embedding-ada-002. Cosine similarity $\mathrm{sim}(a_j, t_i)$ identifies the top- $k$ semantically closest APIs $A_i$ . The LLM subsequently re-ranks to prune irrelevant APIs, yielding $\tilde{A_i}$ .
Experimental Code Generation: For $m=5$ trials, the LLM generates short code snippets $p_{ij} \sim \mathcal{P}_\theta(\cdot|t_i, s, \tilde{A_i}, E_{1,\dots,i-1})$ using only the available documentation and prior chain experiences.
Execution and Feedback: Each $p_{ij}$ is sandboxed, returning an observation $o_{ij} = (\delta_{ij}, \epsilon_{ij}, \gamma_{ij})$ , denoting success, error messages, and printed values, respectively, to form the experience set $\mathcal{E}_i$ .
Self-Debug (Optional): If $\forall_j \delta_{ij} = 0$ , i.e., all variants fail, the LLM attempts to repair and re-execute the snippets, enriching $\mathcal{E}_i$ .
Experience Selection: One successful (or, if none, one failed) experience $\hat{E}_i$ is randomly selected and appended to the global chain $\hat{E}$ .

This iterative process allows the LLM to acquire, via real-world program execution feedback, working knowledge of unfamiliar API signatures, arguments, and output conventions—analogous to human exploratory programming with a REPL.

4. Training-Free and Self-Supervised Properties

The ExploraCoder approach is fundamentally training-free: no LLM parameters are updated or fine-tuned at any stage. Self-supervision arises from the looped exploitation of program execution feedback $(\delta, \epsilon, \gamma)$ to drive successive code generation and selection. Instead of correctness labels or ground-truth programs, the only feedback signal is the observable runtime behavior of candidate snippets, confining all adaptation to inference-time reasoning. Chained experiences allow the LLM to bootstrap its understanding of completely unseen APIs, including their signatures, argument orders, and return types, solely through its interaction with the documentation and execution outcomes (Wang et al., 2024).

5. Experimental Evaluation and Metrics

Two program synthesis benchmarks are utilized: Torchdata-Github (50 tasks, 3–8 unseen APIs per task, mean 4.64 invocations, 228-document pool) and Torchdata-Manual (50 tasks, 8–14 APIs, mean 12 invocations, same doc pool). Both "API-untrained" (e.g., GPT-3.5-turbo-0125, GPT-4-0613) and "API-pretrained" (e.g., GPT-4-1106-preview, CodeQwen-1.5-7B, DeepSeekCoder-6.7B) models are assessed.

Performance is measured with:

pass@k: Fraction of tasks with at least one correct solution among $k$ random samples.
success@k: Fraction with at least one snippet executing without error within a timeout.

Key results (pass@10, GPT-3.5-turbo) include:

Benchmark	Naive RAG + GPT-3.5	ExploraCoder	Absolute Δ
Torchdata-Github	14.67%	21.67%	+7.00%
Torchdata-Manual	0.95%	11.61%	+10.66%

Averaged over both models and benchmarks, ExploraCoder achieves a pass@10 increment of +11.24% over naive retrieval-augmented generation (RAG) and +14.07% over API-pretrained baselines. Self-debugging via ExploraCoder* further increases pass@1 (Torchdata-Manual) from 4.48% to 8.76%.

6. Comparative Frameworks and Observed Advantages

ExploraCoder is compared with DocPrompting / CAPIR (decomposed API retrieval), EpiGen (subtask CoT + RAG), and Self-Repair (end-to-end debug). On the more complex Torchdata-Manual, ExploraCoder’s gains are most pronounced, highlighting the criticality of stepwise exploration and execution feedback for multi-API synthesis. The combination of divide-and-conquer planning, API-focused retrieval, trial-based code generation with live execution, and selection based solely on runtime feedback leads to a robust, adaptive framework for API use discovery in the absence of prior exposure or finetuning (Wang et al., 2024).

7. Implications and Future Directions

The described methodology decouples API use acquisition from model retraining, aligning with the needs of practical development environments where API churn is inevitable. This suggests a plausible path toward more generalizable code generation agents—even in environments with poorly documented or rapidly evolving libraries. Future inquiry may consider scaling these mechanisms to larger API spaces, integrating hierarchical planning, or exploring more sophisticated self-debugging and experience selection policies to further improve the efficiency and reliability of self-supervised API use discovery.

Markdown Upgrade to Chat

References (1)

ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Supervised API Use Discovery.