Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Published 21 May 2026 in cs.SE and cs.AI | (2605.22634v2)

Abstract: Skills have become a practical packaging mechanism for agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, a skill often needs to express more than task guidance: goals, input boundaries, permissions, human approval points, evidence requirements, output contracts, quality criteria, verification steps, and handoff rules. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing SKILL.md files as readable task contracts while preserving lightweight skill discovery and progressive loading. The framework clarifies the boundary between contractual skills, GovernSpec YAML contracts, Model Context Protocol (MCP) surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. We evaluate the framework with three offline empirical studies. The first text-generation experiment covers three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight generation models, producing 960 outputs and 1680 cross-judge score records. The second study is a public-skill A/B expansion: eight public skills are compared with contractual rewrites across forty-eight synthetic tasks, six generation models, two repeats, 1152 outputs, and two complete judge files. In this setting, contractual skills raise mean quality from 4.692 to 4.914 and reduce critical-error rate from 0.083 to 0.013. The third study is an offline tool-calling challenge with eight models and 192 simulated tool-call records. The results suggest that contractual skills are best understood as a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism.

Abstract PDF Upgrade to Chat

Authors (1)

Ting Liu

Summary

The paper introduces contractual skills, a novel framework that converts AI agent instructions into explicit, auditable task contracts.
It demonstrates that contractual skills improve output consistency and reduce critical errors, as shown by a drop in error rate from 0.083 to 0.013.
The framework separates skill intent from enforcement, facilitating scalable governance and streamlined audit processes in enterprise settings.

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Introduction and Motivation

The proliferation of LLM-driven agents in enterprise environments has prompted the need for robust and auditable mechanisms to specify, review, and govern agent behavior. Traditional "skills"—modular instruction packages commonly used to encapsulate agent capabilities—lack the structure required for explicit governance of permissioning, evidence management, and workflow handoffs. Most skill encodings rely on informal or loosely structured Markdown, impeding reviewability, consistency, and runtime safety integration. The paper "Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents" (2605.22634) introduces contractual skills as a disciplined organization of SKILL.md files, leveraging GovernSpec-style fields to encode task boundaries and intent. This paradigm aims to transform skills from prompt fragments into explicit, inspectable task contracts, enhancing the model, maintainers, and evaluators’ shared understanding of execution semantics.

Figure 1: Contractual skills position themselves between a structured task contract and runtime enforcement, making task intent and boundaries inspectable, while enforcement relies on tool adapters and guardrails.

The Contractual Skills Framework

Contractual skills codify agent instructions by organizing SKILL.md files into stable, semantically meaningful sections: goal, audience, required/optional inputs, context, workflow, permissions, human approval gates, evidence standards, output contract, quality bar, verification, and handoff. This approach, motivated by the GovernSpec paradigm, separates policy declarations from enforcement mechanisms, clarifying the system boundary between skill logic, canonical governance contracts (GovernSpec YAML), Model Context Protocol (MCP) surfaces, tool adapters, runtime guardrails, and tracing infrastructure. The contractual skills mechanism thus (1) increases reviewability and testability, (2) enables progressive and selective adoption, and (3) supports downstream policy enforcement by making explicit what is enforceable.

Key design principles include preserving the lightweight nature of skills for discoverability, emphasizing contract fields for explicitness without bureaucratic overhead, ensuring gradual adoption, and maintaining the distinction between intent articulation (skills) and enforcement (guardrails, adapters).

Empirical Evaluation

Text-Generation Study

The primary evaluation involves three synthetic enterprise skill categories (sales-growth, finance-contract, code-review-pro) across 15 subtasks, four instruction conditions (no skill, minimal, plain expanded, contractual), and eight SOTA LLMs. For each model-condition-task, two outputs are collected, totaling 960 generations and 1680 cross-judge scoring records.

Contractual skills outperform the no-skill and minimal-skill baselines in mean model-judge scores on all models. When controlling for information volume via the plain expanded condition, contractual skills yield slight improvements on six of eight models, with small or negative differences on the remainder. The main advantage derived from contractual fields is the stabilization of output structure and checkability, rather than large generic performance gains.

Figure 2: Cross-judge text-generation scores by model and instruction condition; contractual skills consistently outperform no-skill and minimal-skill baselines, with smaller gains compared to plain expanded skills.

Figure 3: Matched score differences under the contractual skill condition: gains over no-skill are universal and sizeable, but improvements over plain expanded skills are modest.

The main practical benefit is observed in output consistency: in the gpt-5.5 run, a 30/30 pass rate is achieved for required section structure under an automated checker aligned with the contractual skill schema.

Market-Validated Skill A/B Expansion

To determine transferability to real-world scenarios, eight publicly available skills are rewritten in contractual form and compared across 48 tasks, six models, and two model-judge raters (1152 paired outputs). Contractual rewrites yield a mean quality increase from 4.692 to 4.914 and, most notably, a reduction in critical error rate from 0.083 to 0.013. The effect is most pronounced in error control and consistency, not in boosting already high-quality outputs.

Tool-Calling Safety Challenge

A simulated tool-calling experiment assesses the efficacy of skill structure in reducing high-risk tool invocation. High-risk tool attempts (e.g., impermissible write actions) decrease under all skill-augmented conditions as compared to no skill, but the contractual format—while generally reducing risk—does not universally suppress unsafe attempts across all models. No model falsely claims completion after blocked attempts.

Figure 4: High-risk tool attempts in the tool-calling challenge. Skill inclusion usually reduces risky actions, but the effect size is model-dependent.

Figure 5: Under the contractual skill condition, some models are conservative in read-tool completions to avoid high-risk tool usage—this tendency influences tool safety conclusions.

Theoretical and Practical Implications

Contractual skills act as a governance layer, contributing explicit task semantics—intent, boundaries, evidence requirements, output shape—to the broader system stack without directly enforcing policies at runtime. This modular separation aligns with best practices for artifact-level governance (2605.22634). It enables scalable asset management for skill repositories, facilitates audit and maintenance, and systematically incorporates role, handoff, and evidence policies that are often only tacit in enterprise operations. Adopting contractual skills supports consistent review, reduces maintenance complexity, and facilitates integration with runtime enforcement (adapters, guardrails), evaluation, and monitoring systems.

The empirical evidence demonstrates that skill structure assists most in behavioral stabilization, output checkability, and critical error mitigation rather than in generic performance improvement. This effect scales with task and model baseline: larger gains manifest where output boundaries are weaker or agent uncertainty is otherwise poorly managed.

Remaining limitations include the synthetic nature of task scenarios, the reliance on autopilot model assessment (rather than expert human rater panels), offline simulation of tool calls without system state, and potential evolutions in field taxonomy across organizations.

Speculation and Future Directions

The contractual skills framework supplies a practical, composable protocol for embedding governance in agent-facing assets. As enterprise adoption of agentic systems intensifies, the formalization of skill contracts is anticipated to become a requisite for compliance, transparency, and maintainability. There is opportunity for further integration with machine-verifiable policy compilers (GovernSpec), extension of checkers to support end-to-end workflow certification, and harmonization with runtime enforcement at the tool, session, or organizational levels. Evaluations with human raters, on-state, and real-world enterprise data are natural future steps to refine error modes, field granularity, and cross-team adoption strategies.

Conclusion

Contractual skills, as formulated in this work, supply a GovernSpec-aligned template for structuring agent instructions as explicit, auditable task contracts. The framework improves structural adherence, checkability, and error control in enterprise agent skills, without replacing runtime enforcement paradigms. These findings substantiate contractual skills as an effective governance layer critical to the transition of enterprise agents from isolated prototypes to production-grade organizational assets, supporting both maintainability and safe operationalization.

Markdown Report Issue