Doc2Agent: Scalable Tool-Using AI Framework

Updated 14 April 2026

Doc2Agent is a framework that transforms unstructured API documentation into Python-callable tools using an LLM-driven iterative refinement process.
It combines structured API extraction, both direct and target-oriented tool generation, and automated validation to ensure high performance.
Empirical results demonstrate significant improvements in success rate and cost efficiency across diverse benchmarks, including real-world and scientific APIs.

Doc2Agent is a scalable framework for the automatic generation of tool-using AI agents directly from unstructured API documentation. By constructing validated Python-based tool wrappers from real-world, research, and domain-specific APIs, and employing iterative refinement with LLM-driven agents, Doc2Agent enables deployment-ready agents capable of robust API interaction across heterogeneous domains. The approach has demonstrated strong gains in success rate and efficiency over prior baselines in standard benchmarks, and generalizes to complex, knowledge-rich scientific workflows (Ni et al., 24 Jun 2025).

1. Pipeline Overview: From API Documentation to Executable Agents

Doc2Agent consists of four principal stages, transforming unstructured HTML/Markdown API documentation into an agent with Python-callable tools wrapping those APIs:

API Extraction: A LLM in "structured" mode parses API documentation and fills a schema capturing endpoint names, HTTP methods, URLs, and parameterization. Patterns for parameters in documentation (e.g., :param, {param}) are explicitly handled.
Tool Generation:
- Direct Tool Generation: For endpoints with well-defined JSON schemas, Doc2Agent directly converts these into Python wrappers, propagating example/default values where possible.
- Target-Oriented Tool Generation: For more flexible or underspecified endpoints, the system requests a "fingerprint" consisting of minimal interface + I/O spec, then expands this into a full wrapper focused on a common use case, such as parameterized search.
Validation and Refinement:
- Generated tools are invoked with example arguments; outputs are automatically compared to LLM-predicted "expected" results for verification.
- Failures trigger either parameter value inference (via embedding-based retrieval from a parameter database) or repair by a code-debugging agent (e.g., Claude Sonnet), which corrects code structure, parameter handling, and docstrings according to error traces and documentation context.
- This auto-validation/refinement loop typically repeats for up to three rounds, recovering almost half of initially failing tools.
Deployment:
- Validated tools are published via Model Context Protocol (MCP) servers (FastAPI + MCP) or as OpenAPI specs.
- Agents using frameworks such as LangGraph, AutoGen, LlamaIndex, CodeAct, Qwen-Agent, or Claude Desktop can load these wrappers natively, enabling immediate integration (Ni et al., 24 Jun 2025).

After initial tool construction, Doc2Agent employs an auto-validation loop:

Each tool is tested with its example arguments, and responses are classified (information, code error, request error, server error) using an LLM evaluator.
On failure, the system performs parameter value inference: unknown parameters are embedded, matched semantically to candidates from parameter databases, and tested iteratively.
If failures persist, the code agent receives the error trace, original code, documentation, and candidate parameters, and produces a fixed function with corrected URL handling, signature, assertions, and examples.
After each refinement, tools are retested and the parameter database updated with passing values.
This process continues for up to three rounds, maximizing recovery rates.

Validated tools are marked as verified, and in large-scale experiments, ~60% of real-world and >80% of research API tools pass the complete pipeline (Ni et al., 24 Jun 2025).

3. Empirical Results and Benchmark Metrics

Doc2Agent achieves substantial performance and efficiency gains across multiple benchmarks. Key findings include:

Benchmark	Endpoints	Tools Validated (Rate)	Success Rate (WebArena Avg.)	Relative Improvement	Token Cost/Task
Real-world APIs	744	443 (59.5%)	-	-	-
Glycoscience APIs	131	70 (81.5%)	-	-	-
WebArena (avg. over 5 tasks)	-	-	45.3% (vs 29.2% baseline)	55%	\$0.12 (10% of baseline)
Glycomaterial Science (Pass@10)	70	-	Ours: 33 vs GPT-4o: 17	-	-

Refinement rounds recover +47.6% of initially failing tools.
Glycoscience agent: on 50 research tasks, success_all ranged from 30.0–36.0% (depending on framework), with success_filtered up to 58.1%.
WebArena: Relative improvement computed as Δ = (A_ours - A_baseline) / A_baseline × 100%, e.g., 57.8% for Shopping.
Cost: Achieves 90% cost reduction per task compared to direct API-calling baselines (Ni et al., 24 Jun 2025).

4. Evaluation Protocols and Benchmarks

The evaluation suite spans several API types:

167 real-world public APIs (no key required; 744 endpoints)
WebArena, covering Wiki, Map, Shopping (customer/admin), and GitLab environments
Glycoscience: 10 APIs spanning major glycoinformatics sources

Baselines include direct API-calling agents (prompting on raw documentation without prior tool extraction/validation). Metrics include:

Tool validation rate (proportion of endpoints yielding functioning Python wrappers)
Success rate per task/domain
Pass@k for parameter inference
Token/compute cost per task
Comparative agent performance (e.g., Claude Desktop, CodeAct, Qwen-Agent) (Ni et al., 24 Jun 2025).

5. Adaptability and Domain-Specific Agent Construction

Doc2Agent demonstrates adaptability beyond typical web APIs, as evidenced by its deployment for glycomaterial science:

70 validated tools generated from 16 glycoscience sites (81.5%)
Outperforms LLM baselines in parameter inference (Pass@10: 33 vs. 17 for GPT-4o)
Robust on auto-generated research workflows (success_all up to 36.0%)
Handles complex data integration across knowledge-rich tasks such as cross-database compound matching and glycobiology analytics (Ni et al., 24 Jun 2025).

This illustrates Doc2Agent's suitability for domains requiring integration of API reading, parameter induction, and scientific reasoning.

6. Robustness, Scalability, and Limitations

The Doc2Agent pipeline exhibits strong robustness features:

Automated parameter inference and error recovery mechanisms
Environment-agnostic tool generation
Deployment as interoperable MCP/OpenAPI endpoints

Nevertheless, certain limitations remain:

Incomplete or poorly documented APIs can prevent complete agentification
Manual benchmark creation is still necessary for rigorous evaluation
Extension to non-English documentation has not been fully tested
Multi-agent collaborative workflows and persistent agent state require further development

Future directions identified include scaling to large scientific corpora, integrating LLM-based judges for auto-evaluation, supporting multilingual and multi-document workflows, and developing user collaboration and permission controls (Ni et al., 24 Jun 2025).

7. Extensions and Generalization to Other Document Types

Doc2Agent's methodology can be extended to non-API document types via adaptation of each pipeline stage:

Ingestion: handle PDF, Word DOCX, Markdown, and HTML with domain-specific section classifiers
Representation: support graph-structured document content (e.g., clause graphs in legal/regulatory documents, exercises in textbooks)
Schema augmentation: add resources for definitions, requirements, Q&A, and compliance checks
Example: extracting ISO standard requirements and exposing "check_requirement_X(data)" tools for compliance automation

This generalizes the architecture beyond API documentation, supporting the construction of interactive agents from arbitrary structured or semi-structured documents and facilitating a broad shift toward agentic, document-driven computation (Ni et al., 24 Jun 2025).

References

Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation (Ni et al., 24 Jun 2025)

Markdown Report Issue Upgrade to Chat

References (1)

Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Doc2Agent.

Doc2Agent: Scalable Tool-Using AI Framework

1. Pipeline Overview: From API Documentation to Executable Agents

2. Iterative Validation and Code Agent Refinement

3. Empirical Results and Benchmark Metrics

4. Evaluation Protocols and Benchmarks

5. Adaptability and Domain-Specific Agent Construction

6. Robustness, Scalability, and Limitations

7. Extensions and Generalization to Other Document Types

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Doc2Agent: Scalable Tool-Using AI Framework

1. Pipeline Overview: From API Documentation to Executable Agents

2. Iterative Validation and Code Agent Refinement

3. Empirical Results and Benchmark Metrics

4. Evaluation Protocols and Benchmarks

5. Adaptability and Domain-Specific Agent Construction

6. Robustness, Scalability, and Limitations

7. Extensions and Generalization to Other Document Types

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research