SkillSmith Compiler Framework
- SkillSmith is a compiler-runtime framework that transforms developer-authored skill packages into efficient boundary contracts with explicit operational conditions.
- It reduces redundant computation by offline compiling assets and minimizing runtime re-interpretation, leading to lower token usage and faster execution.
- Evaluations on SkillsBench show significant reductions in tokens, time, and iterations compared to raw skill execution, highlighting its practical efficiency gains.
SkillSmith is a boundary-first compiler-runtime framework for LLM agent skills. In the skill literature, an agentic skill is commonly formalized as a tuple , comprising an applicability condition, an executable policy, a termination condition, and a reusable interface. SkillSmith targets developer-authored skill packages—typically a SKILL.md entry document together with scripts, templates, and reference assets—and compiles them offline into minimal executable interfaces that expose operational boundaries rather than full prompt-time payloads. Its central claim is that much of the cost of skill use in agents arises not from execution itself, but from repeatedly injecting and reinterpreting monolithic skill descriptions at runtime (Xu et al., 12 May 2026, Jiang et al., 24 Feb 2026).
1. Skills, packages, and the object of compilation
Within SkillSmith, the unit of compilation is the skill package , where is the entry SKILL.md document, is the set of packaged assets, is metadata, and is a content hash of the canonicalized package. Each asset records its relative path, inferred type, byte size, and sha256 digest. When such a package is used directly inside an agent’s reasoning loop—by injecting SKILL.md and referencing assets without compilation—it is termed a raw skill (Xu et al., 12 May 2026).
This framing places SkillSmith within a broader movement that treats skills as callable procedural modules rather than one-off plans or atomic tool calls. The SoK on agentic skills distinguishes skills from tools, plans, episodic memory, and prompt templates by emphasizing reusable interfaces, explicit applicability conditions, and termination criteria. SkillSmith can therefore be read as a system that operationalizes one part of the skill lifecycle—storage, retrieval, execution, and update—through compiler and runtime abstractions rather than pure prompt engineering (Jiang et al., 24 Feb 2026).
A common misconception is that SkillSmith is primarily a prompt-compression method. The framework is instead organized around operational boundaries: what a skill can do, under what conditions, with what operator schemas, risks, validation levels, and fallback paths. This boundary-first view is meant to transform skills from long mixed-format documents into ABI-like runtime contracts (Xu et al., 12 May 2026).
2. Redundancy in raw-skill execution
SkillSmith is motivated by two empirical inefficiencies in standard raw-skill execution. First, agents often inject an entire skill package into context even when only a small subset is relevant. On seven SkillsBench tasks, the reported average load was around 17.8K skill-source tokens per execution, of which 9.1K tokens (51.21%) were irrelevant to the observed execution trace. Second, skill-specific planning is repeatedly rediscovered online: across tasks sharing the same skill, the reported average reasoning-trace similarity was 45.5%, indicating that nearly half of the reasoning structure was being recomputed rather than reused (Xu et al., 12 May 2026).
In the conventional ReAct-style loop, the agent detects that a task appears to match a skill, retrieves the full package, injects it into the context window, rereads the instructions, reconstructs a plan, invokes tools, observes outputs, and iterates. SkillSmith treats this as a poor execution substrate for reusable procedural knowledge because the skill remains a monolithic text-and-assets resource rather than a structured runtime interface (Xu et al., 12 May 2026).
This diagnosis is important because it shifts the bottleneck from model capability to system architecture. The claimed inefficiency is not merely that models are slow or verbose; it is that the same skill knowledge is reinterpreted from scratch every time it is used. SkillSmith’s response is to move most of this interpretation offline.
3. Boundary contracts and internal representations
The compiled product of SkillSmith is a boundary contract
where is the boundary type, is the set of operators, 0 is the input/output contract, 1 is the set of risk flags, 2 is the validation level, 3 is the action policy, 4 is the selection policy, and 5 is fallback metadata. The contract is the public runtime ABI of the skill: it exposes only what is needed to invoke reusable procedures, check policies and preconditions, and return to the original package when necessary (Xu et al., 12 May 2026).
Compilation takes as input
6
where 7 is the skill package, 8 is the available tool interface, 9 is the execution environment, and 0 is the set of compilation policies. The overall mapping is
1
SkillSmith first classifies the package by source shape, lowers it into an internal representation, normalizes and validates that representation, and finally synthesizes the public boundary contract (Xu et al., 12 May 2026).
| Source shape | Typical evidence | Compiler-local lowering |
|---|---|---|
| workflow | ordered steps, control-flow headings, verification instructions | workflow graph |
| dispatcher | bundled scripts, API descriptions, independent operations | dispatcher capabilities and typed operators |
| reference | prose, tables, formulas, examples, templates | indexed reference sections |
| insufficient | missing or ambiguous structure | diagnostic with runtime fallback |
For workflow skills, the internal representation is a step-level directed acyclic graph whose nodes contain step identifiers, names, types, execution specifications, input/output contracts, dependencies, provenance links to SKILL.md or assets, risk annotations, and cacheability flags. Validation checks include existence of referenced nodes, acyclicity, runnable specifications for executable nodes, and upstream availability of required inputs. Nodes that cannot be grounded are explicitly marked for runtime LLM assistance rather than deterministic execution (Xu et al., 12 May 2026).
SkillSmith is therefore neither a universal workflow IR nor a fully compiled solver. It extracts and normalizes only the reusable boundary of a skill, preserving a lossless fallback capsule for the original package.
4. Offline compilation and guarded runtime
The offline compiler uses structural signals from SKILL.md and packaged assets—such as headings, ordered lists, imperative verbs, command blocks, scripts, API signatures, and reference density—together with environment and tool information. A compile-time LLM uses these features to classify source shape, propose graph nodes or operators, and annotate risk and validation information. Across GPT‑5.5 and Claude Opus 4.7 compilation runs, producing one reusable artifact cost 3,104 tokens and 13.22 seconds on average (Xu et al., 12 May 2026).
At runtime, compiled skills are advertised as compact handles, such as run_{skill}, together with short boundary summaries. Detailed operator schemas and fallback material are not injected initially; they are disclosed only when the agent chooses that handle. The runtime then interprets the boundary contract as a guarded state machine. It checks applicability and policies, chooses an operator when required, evaluates risk and validation metadata, and returns a canonical envelope containing status, contribution type, selected operator, outputs, trace references, and a continuation flag (Xu et al., 12 May 2026).
The three principal outcomes are blocked, guidance, and execute. A blocked result returns the reason and deoptimization hints. A guidance result returns reference-style material without side effects. An execute result runs a typed operator or bound script and returns typed evidence or solver-like output. This design makes the skill contribution explicitly partial rather than silently end-to-end (Xu et al., 12 May 2026).
SkillSmith also decouples the compiler model from the runtime model. A stronger model may analyze and compile a skill once, while a smaller or cheaper runtime model later consumes the compiled artifact. This cross-model reuse is one of the framework’s central claims.
5. Evaluation on SkillsBench
SkillSmith was evaluated on SkillsBench, which in the reported checkout contained 87 runnable tasks and 227 task-local skill packages. The main evaluation used seven representative tasks spanning document generation, numerical computation, clustering, transcription, and file manipulation: 3d-scan-calc, mars-clouds-clustering, video-tutorial-indexer, citation-check, jax-computing-basics, pptx-reference-formatting, and offer-letter-generator (Xu et al., 12 May 2026).
Across all seven tasks in the main GPT‑5.5 + Agent-H setting, SkillSmith reported the following solve-stage totals:
| System | Tokens | Time | Iterations |
|---|---|---|---|
| SkillSmith | 620K | 494 s | 61 |
| Raw-Skills | 1.5M | 999 s | 107 |
| SkVM | 1.2M | 933 s | 75 |
Relative to Raw-Skills, these totals correspond to a 57.44% reduction in solve-stage token usage, 42.99% reduction in thinking iterations, 50.57% reduction in solve time, and 57.44% reduction in token-proportional monetary cost. Relative to SkVM, SkillSmith reported 46.49% fewer tokens, 47.04% less time, and 18.67% fewer iterations (Xu et al., 12 May 2026).
Cross-model results used artifacts compiled by Claude Opus 4.7 and reused them across runtime models. In success-preserving comparisons, the reported average gains were 38.33% time reduction, 32.83% token reduction, and 23.89% iteration reduction. The paper also reports cases where compiled skills succeeded while raw skills failed for DeepSeek V4 Flash, specifically on offer-letter-generator, pptx-reference-formatting, and video-tutorial-indexer. At the same time, the framework is not accuracy-monotone: regressions were observed for some Qwen and DeepSeek settings, and one Claude + PPTX case, showing that compilation can over-constrain or misrepresent skill usage for some model-task pairs (Xu et al., 12 May 2026).
Harness-level comparisons also showed broad efficiency gains. Relative to OpenCode Raw, tokens were reduced by 55.8% and time by 27.4%; relative to Codex Raw, tokens were reduced by 77.0% and time by 52.7%; relative to Agent-H Raw, tokens were reduced by 57.4%, time by 50.6%, and iterations by 43.0% (Xu et al., 12 May 2026).
6. Relation to adjacent frameworks and security questions
SkillSmith belongs to a family of recent systems that elevate skills into first-class systems objects, but adjacent work differs sharply in what is being optimized. SkCC compiles SKILL.md into a strongly typed IR called SkIR and emits platform-specific skill formats for Claude Code, Codex CLI, Gemini CLI, and Kimi CLI. Its emphasis is portability and compile-time security via Anti-Skill Injection. On SkillsBench, SkCC reported pass-rate improvements from 21.1% to 33.3% on Claude Code and from 35.1% to 48.7% on Kimi CLI, together with sub-10ms compilation latency, a 94.8% proactive security trigger rate, and 10–46% runtime token savings across platforms (Ouyang et al., 5 May 2026).
A different complementary line is SkillTTA, which keeps the solver fixed and synthesizes a temporary task-specific SKILL.md at test time from a small set of retrieved trajectories. On SpreadsheetBench, task-specific skills improved Pass@1 from 0.397 to 0.505 relative to static trajectory-to-skill synthesis, and on BigCodeBench from 0.517 to 0.651; on ALFWorld the method reached 0.872–0.879 success while producing the shortest successful trajectories among reported methods (Wang et al., 16 May 2026). SkillTTA thus reduces genericity by synthesizing ephemeral skills online, whereas SkillSmith reduces redundancy by compiling developer-authored skills offline into runtime interfaces.
Security research has also made clear that skill compilation does not eliminate ecosystem-level risk. The SoK on agentic skills documents supply-chain threats, prompt injection via skill payloads, trust-tiered execution, and the ClawHavoc case, in which nearly 1,200 malicious skills infiltrated a major marketplace (Jiang et al., 24 Feb 2026). Separately, the SkillReact study of compositional risk reports that among 651 individually safe skills, the 211,575 unordered pairs contained 47,075 structural compositional-risk candidates, a 22.25% candidate rate; after pattern-stratified human calibration, the headline population-weighted validity estimate was 18.2% for flagged pair-pattern memberships (Wang et al., 30 May 2026). These results imply that any SkillSmith-like deployment requires more than per-skill compilation: it also requires install-time compositional checks, capability isolation, provenance, and runtime governance.
7. Broader uses of the “SkillSmith” idea
Although SkillSmith is a specific compiler-runtime system, adjacent papers use the name more broadly as a design metaphor for systems that forge, synthesize, or compose skills. In robotics, RobotSmith explicitly reframes complex manipulation as a joint design-and-control problem and states that, under a “SkillSmith” lens, a skill is the triple
2
Its pipeline combines collaborative VLM agents, high-level program synthesis over move, grasp, and release, and CMA-ES optimization in the Genesis physics engine. Across nine tasks, RobotSmith reported an average success rate of 50.0%, versus 21.4% for 3D generation and 11.1% for tool retrieval, and it describes itself as a concretely implemented subsystem of a hypothetical SkillSmith (Lin et al., 17 Jun 2025).
In mathematical reasoning, MathSmith is likewise framed as a skill-forging system: it synthesizes new problems from PlanetMath concept–explanation pairs, shapes them with nine predefined difficulty strategies, and uses reinforcement learning to optimize structural validity, reasoning complexity, and answer consistency. Across five benchmarks, it reports consistent gains over baselines on hard benchmarks under both short and long chain-of-thought settings; for example, with long-CoT Qwen3‑8B, MathSmith‑HC raised hard-benchmark average performance from 65.4 to 71.8 (Zhan et al., 7 Aug 2025). This suggests a broader interpretation of “SkillSmith” as a class of systems that move beyond static prompt-time guidance and instead engineer reusable procedural competence through compilation, synthesis, optimization, or controlled generation.
Taken together, these strands indicate that SkillSmith names both a specific boundary-guided runtime framework and an emerging systems perspective: skills are treated as explicit artifacts that can be packaged, compiled, verified, retrieved, refined, and, in some domains, automatically invented.