- The paper demonstrates that OpenAPI documentation often lacks essential semantic cues for reliable agent consumption, as shown by pervasive response, lazy, and input smells.
- It introduces Hermes, which employs specialized agents and token-efficient endpoint analysis, achieving high alignment (F1_micro=0.92) in multi-label smell detection.
- It reveals that targeted documentation remediation can reduce engineering effort by 89%, thereby transforming documentation assessment into a preventive governance mechanism.
Multi-Agent LLM-Based Documentation Assessment for Agent-Ready OpenAPI APIs
Motivation and Industrial Context
The paper "Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System" (2605.14312) addresses the critical challenge of transitioning API ecosystems for reliable autonomous agent consumption. Recent advances in AI agents and protocols such as Model Context Protocol (MCP) have prompted industrial organizations to expose their legacy REST APIs for agent-based automation. However, systematic failures in proof-of-concept experiments revealed that successful agent interaction is not only constrained by model limitations but by underlying documentation quality. Despite APIs being structurally valid and stable within microservice contexts, their OpenAPI specifications often lack sufficient semantic richness and explicitness for AI agent workflows.
Methodology: Hermes System and Smell Taxonomy
Hermes, the proposed multi-agent LLM-based system, operationalizes documentation assessment at ecosystem scale. The system employs specialized agents, each targeting distinct categories of documentation and REST-related smells, including LAZY, BLOATED, TANGLED, FRAGMENTED (documentation smells), and PATH, METHOD, INPUT, RESPONSE, SECURITY (REST smells). The agents leverage structured prompting, few-shot classification, and endpoint-centric reduced OpenAPI representations to enhance both analytical focus and LLM reliability. Endpoint-level analysis enables scalable, token-efficient assessment of large API ecosystems, addressing limitations of monolithic inspection.
Empirical validation and smell detection were preceded by comparative evaluation of LLMs on a gold-standard subset, with gpt-oss:120b achieving the highest alignment (Jaccard = 0.85, F1_micro = 0.92, Hamming Loss = 0.07). This step ensured robust multi-label classification and minimized model-induced bias in large-scale analysis.
Results: Smell Prevalence and Practitioner Validation
Across 600 endpoints within 16 APIs, Hermes detected a total of 2,450 documentation and REST-related smells. Every endpoint exhibited at least one smell, with Response, Lazy, and Input categories dominating (Response: 100%, Lazy: 90%, Input: 88%). Frequent deficiencies included minimalistic endpoint summaries, absent or generic descriptions, poorly specified parameters, and opaque response schemas (often DTO-based with arbitrarily shaped 'data' fields devoid of explicit constraints). Security-related smells were prominent (68%), especially in the absence of operational guidance for authentication mechanisms. REST design inconsistencies appeared in 53% of endpoints, manifested via action-oriented URIs and improper HTTP method usage.
Practitioner validation confirmed high agreement with agent-based smell detection for explicit deficiencies. However, contested cases exposed sociotechnical factors, where minimal documentation sufficed for internal use but failed for external or autonomous consumption. Notably, the evaluation process itself acted as a reflective instrument, sensitizing developers to documentation quality for broader consumer contexts.
Strategic Implications and Cost Analysis
The study demonstrates that structural validity of OpenAPI specifications is not synonymous with agent-readiness. Documented failures in agent planning and tool selection traced back to insufficient semantic cues within the specifications, rather than API instability. Cost analysis revealed that ecosystem-wide remediation (385 engineering hours) would be infeasible, while targeted adaptation for priority endpoints (42 hours) was substantially more efficient (89% reduction in effort).
Selective adaptation emerged as a strategic response: endpoints relevant for defined automation scenarios were prioritized, documentation standards redefined, and Hermes integrated into API governance workflows. This approach reduced technological risk and prevented propagation of systemic failures that would arise from indiscriminate conversion.
Theoretical and Practical Insights
The paper refines the understanding of documentation smells as structural debt activated by novel consumption paradigms. It positions artifact-level readiness evaluation—rather than runtime validation or model upgrades—as the decisive gatekeeping mechanism for AI adoption. Integration of Hermes transitions documentation assessment from diagnostic to preventive governance, institutionalizing quality control in software artifacts.
Industrial adoption of AI agents is thus recast as a sociotechnical shift, where latent conventions, legacy practices, and implicit knowledge dependencies surface as operational risks. Conformance to OpenAPI or REST principles is necessary but not sufficient for agent-based consumption; explicit, self-contained semantic cues become the critical enabler for reliable autonomous workflows.
Limitations and Future Directions
While grounded in a single industrial microservice ecosystem, the study's central finding—that documentation readiness constrains agent-based API consumption—is theoretically applicable to public and open API environments. Model selection constraints, estimation of remediation efforts, and scope of automation scenarios are noted as limiting factors. Longitudinal tracking of documentation governance, controlled studies correlating remediation with agent performance, and replication across diverse domains are proposed as directions for future research.
Conclusion
The empirical evidence establishes that artifact-level documentation assessment is a prerequisite for strategic AI-agent integration. Structural and semantic deficiencies embedded in OpenAPI specifications represent not only technical debt but existential constraints for reliable agent consumption. Systematic evaluation, selective adaptation, and integrated governance mechanisms are necessary to mitigate technological risk and optimize software ecosystem readiness. Documentation quality, therefore, becomes an explicit architectural concern in the era of agent-driven automation, shaping both practical decision-making and theoretical perspectives in industrial software engineering.