Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Published 26 Mar 2026 in cs.AI | (2603.25158v1)

Abstract: Equipping LLM agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills -- requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

Summary

  • The paper introduces a framework that extracts and consolidates trajectory-local lessons into a unified skill set using parallel, hierarchical merging.
  • It demonstrates significant performance gains and cross-model transferability, achieving improvements of up to +57.7 absolute points in OOD benchmarks.
  • The approach decouples skill evolution from agent parameter updates, enabling scalable and efficient development of robust procedural skills.

Trace2Skill: Distilling Trajectory-Local Lessons Into Transferable Agent Skills

Problem Setting and Motivation

The scalability and adaptability of LLM-driven agents hinge fundamentally upon the integration of domain-specific, procedural skills. Hand-authoring such skills is non-scalable and often offers brittle benefits across models and domains, as human-crafted skills might not generalize nor yield matching gains across agent architectures and task distributions. Automated skill induction from parametric knowledge is typically shallow and lacks the contextual specificity needed for robust performance in complex environments, as demonstrated by the negligible uplift of parametric skill drafts over no-skills baselines. Existing techniques for automatic skill evolution often rely on sequentially overfitting to episodic mistakes in online regimes, resulting in fragmentation and limited transferability.

Trace2Skill addresses this foundational bottleneck by introducing a formalism and algorithmic pipeline for holistic skill construction: parallelizing the extraction of trajectory-local lessons and consolidating them via inductive, conflict-aware merging. The core hypothesis is that robust, generalizable skills can be mined from sufficiently diverse agent execution experiences by treating trajectory analysis as an inductive process, yielding skills that transfer across model scales and out-of-distribution (OOD) tasks without agent parameter updates.

Methodology

Trace2Skill formalizes skill evolution as an artifact-level adaptation problem: given an agent πθ\pi_\theta, a fixed initial skill S0\mathcal{S}_0 (draft or human-authored), and an evolving set of task trajectories, the objective is to construct an improved skill S\mathcal{S}^* such that zero-shot transfer to Dtest\mathcal{D}_{test} yields measurable gains. The framework decomposes the process into three key stages:

  • Trajectory Generation: The base agent operates over tasks, generating a labeled trajectory corpus partitioned into successes and failures.
  • Parallel Patch Proposal (Analyst Swarm): A swarm of sub-agents processes each trajectory in isolation, producing targeted patches. Error analysts leverage a ReAct-style multi-turn workflow to causally diagnose and validate failures, while success analysts decompose effective behaviors into generalizable patterns.
  • Hierarchical Conflict-Free Consolidation: All proposed patches are merged in a logarithmic hierarchy, leveraging the same LLM to perform prevalence-aware inductive reasoning and programmatic conflict elimination, resulting in a single, well-structured skill directory.

Crucially, this pipeline enables two modes: (1) skill deepening, refining a human-written prior; and (2) skill creation, evolving functional skills de novo from parametric drafts. All phases are fully parallel over tasks, and the entire system is architecturally independent of model parameter updates or external retrieval modules.

Empirical Results

In-Distribution and Cross-Model Generalization

Trace2Skill demonstrates substantial, robust gains over strong baselines on SpreadsheetBench and WikiTableQuestions (OOD). Across all model scales (Qwen3.5-35B, Qwen3.5-122B), skills distilled by Trace2Skill on one agent generalize effectively to others:

  • Skill Deepening: On top of Anthropic’s human-written xlsx skill, deepening via error analysis yields up to ++27.0 absolute points (pp) improvement for 35B agents and ++21.5pp for 122B models, with similar cross-model transfer.
  • Skill Creation from Scratch: De novo skills evolved from purely parametric drafts yield ++22.8pp (122B) and ++57.7pp (35B-to-122B, OOD) over baseline, matching or surpassing human-authored reference skills.

The skills exhibit strong transfer to OOD domains and unseen agents, directly contradicting prevalent assumptions that trajectory-derived experience is inherently episodic and model-bound. By contrast, episodic retrieval-based techniques (e.g., ReasoningBank (Ouyang et al., 29 Sep 2025)) show far less cross-task and cross-model generalization.

Efficiency and Structural Analysis

The parallel, many-to-one consolidation operator outperforms both sequential online updates and retrieval-memories in efficiency and performance. Hierarchical merging prevents order-dependent drift and overfitting, instead prioritizing patterns recurring across independent sub-trajectories. For instance, merge-driven Patch Consolidation completes in \sim3 minutes, a 20×20\times speed-up over online sequential editing, while yielding higher absolute gains.

The superior performance of the framework also derives from the agentic error analysis sub-agents, which—unlike single-call LLM error analysis—iteratively validate, debug, and patch failure traces via artifact inspection and fix verification, anchoring the resulting skills in verified causal mechanisms rather than superficial logging artifacts.

Domain Generalization: Math Reasoning and Multimodal VQA

Trace2Skill is domain-agnostic, as demonstrated by competitive uplifts in both mathematical reasoning (DAPO-Math-Test, AIME 2026) and Visual QA (DocVQA). In math, error-analyst-evolved skills yield up to ++5pp gains across models, maintaining transferability. In multimodal VQA, skills authored by higher-capacity models provide large accuracy gains for both the native and smaller agents, pinpointing a dissociation between task execution and effective skill induction.

Analysis and Ablations

Consolidation Strategy: Parallel, hierarchical consolidation is strictly superior to sequential or batched online editing in both accuracy and engineering efficiency due to its global prevalence bias and conflict resolution mechanisms.

Retrieval Baseline Comparison: Trace2Skill's declarative skills outperform retrieval-memories, especially for small models, as retrieval degrades with test-query divergence and incurs runtime competition for LLM context/memory.

Agentic vs. Single-Call Patching: Agentic, multistep error analysis produces more causal, generalizable skill sections; single-turn LLM approaches misattribute and hallucinate error causes, reducing transferability, especially in OOD and cross-agent contexts.

Skill Structure: The nature of the induced skill directory mirrors established expert practices: high-frequency error/failure patterns emerge as explicit “critical warning” checklist items, while low-frequency or contextual quirks are modularized into references/.

Theoretical Implications and Practical Consequences

The Trace2Skill paradigm provides strong evidence that nonparametric experience can be systematically compressed into declarative, modular, transferable skills with no agent parameter updates. This enables decoupling agent model advances from skill evolution pipelines: mid-scale, open-source LLMs are sufficient to induce strong procedural skills, which can then be ported to frontier models (or vice versa).

The hierarchical, many-to-one inductive strategy can be interpreted as in-silico analog to human expert codification of procedural knowledge, effectively operationalizing large-scale domain-generalization via high-throughput, conflict-aware abstract pattern mining. This is of significant importance as the agent-skills ecosystem expands, favoring approaches that maximize transfer, minimize fragmentation, and permit human inspection and pruning.

Limitations and Future Directions

The current pipeline applies patch consolidation holistically, and as such, marginal contributions or interferences between specific skill edits are not causally isolated, impeding precise attribution and automated pruning. Trace utility tracing over the agent’s inference process (tracking which skill sections are actually invoked and beneficial) is also open for future research. These advancements would enable more technical rigor in skill refinement, e.g., counterfactual editing and section-wise attribution, and tighter integration with agent harnesses.

Conclusion

Trace2Skill demonstrates that systematic, large-scale mining of agent execution trajectories—processed in parallel and hierarchically merged—yields artifact-level skills that are demonstrably transferable across model architectures, agent scales, and task distributions. The pipeline achieves these gains without reliance on proprietary LLMs, without fine-tuning, and without runtime retrieval or memory modules. This work provides a clear blueprint for scalable skill evolution and codification, relevant for practitioners seeking robust, portable foundation agent skills (2603.25158).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper is about teaching AI assistants (powered by LLMs) better “skills” so they can solve hard, real-world tasks. A “skill” here is like a clear how-to guide: when to use a method, steps to follow, common mistakes to avoid, and small helper tools or scripts. The authors introduce Trace2Skill, a way to automatically build and improve these skills by studying many past attempts the AI made, then turning what it learns into one clean, easy-to-use guide.

What questions did the researchers ask?

  • How can we create or improve AI skills without relying on humans to write everything by hand?
  • Can we learn from many past attempts at once (like a teacher reviewing a whole homework stack) and combine the lessons into one strong guide, rather than updating the guide one attempt at a time?
  • Will the skills we create this way be sturdy and transferable—usable by different AI models and on new, different tasks?

How does Trace2Skill work?

Think of Trace2Skill like making a great study guide from lots of practice tries.

  • Step 1: Collect attempts
    • The AI tries many tasks and records its step-by-step process and result. Each recorded attempt is called a “trajectory” (like a play-by-play of everything it did).
    • We label these attempts as successes (worked) or failures (didn’t work).
  • Step 2: Many “assistant reviewers” analyze in parallel
    • Multiple small helper AIs each study one attempt and suggest “patches” (short edits) to improve the skill guide.
    • One kind of helper looks at successful attempts to capture good habits.
    • Another, more careful helper looks at failed attempts, digs into what went wrong (like checking files or comparing answers), and writes fixes to avoid those mistakes next time.
  • Step 3: Merge suggestions into one clean guide
    • All patches are combined into a single, consistent skill document.
    • The system removes duplicates, resolves conflicts, and keeps only ideas that show up across many different attempts. This is like “inductive reasoning”: noticing patterns that happen often and turning them into general rules.

Two ways to use it:

  • Deepening: start with a human-written skill and make it better.
  • Creation from scratch: start with a rough, weak draft and turn it into a truly useful skill by learning from attempts.

In plain terms: instead of reacting to each attempt one by one, Trace2Skill studies a big batch all at once, finds common lessons, and writes a single, solid guide.

What did they find?

  • Better performance on tough tasks:
    • Spreadsheets: The system made noticeable improvements over strong baselines, even beating an official, expert-made spreadsheet skill in many cases.
    • Table questions (WikiTableQuestions): A skill evolved by a smaller model (35B) raised a larger model’s score by up to 57.65 percentage points—showing strong transfer.
    • Math: From scratch, learned math skills improved success on both a math test set and AIME-style problems.
    • Vision + documents (DocVQA): A learned skill significantly boosted accuracy on reading and reasoning over document images.
  • Transferability:
    • Skills created using one model also helped different-sized models.
    • Skills created on one dataset still worked on different, related tasks (out-of-distribution or OOD), not just the exact problems seen during practice.
  • Stronger than common alternatives:
    • Parallel consolidation (analyzing many attempts at once) beat sequential, one-by-one updates in both quality and speed.
    • A single compact skill document worked better than “retrieval memory” systems that store example tips and try to fetch them later.
    • Letting a helper agent deeply analyze failures (step-by-step) produced more useful fixes than asking a model to summarize errors in a single quick pass.
  • Practical and efficient:
    • No need to retrain the AI’s internal parameters.
    • No extra memory or retrieval modules at runtime.
    • Works well even with open-source models as small as 35B parameters.

Why is this important?

If we want AI assistants to handle real, complicated jobs—like editing spreadsheets, solving math problems, or answering questions from images—we need reliable, reusable “how-to” guides. Writing and maintaining these by hand is slow and doesn’t scale. Trace2Skill shows we can automatically turn messy real-world experience into clear, transferable skills that:

  • Make different models better,
  • Generalize to new tasks,
  • Require no retraining,
  • And come in a single, portable document that’s easy to share and use.

Bottom line and impact

Trace2Skill is like a skilled teacher who reviews many student attempts at once, picks out the most common successes and mistakes, and writes a simple, strong study guide. This guide helps future students (AI models) do better—not just on the same homework, but also on new kinds of problems. That means faster, cheaper, and more reliable AI improvement across many domains without constant human rewriting or retraining.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper that future work could address.

  • Generality across model families: The study uses Qwen3.5 (35B/122B) for authoring and usage; it does not evaluate whether skills authored by one model family (e.g., Qwen) transfer to different families/architectures (e.g., Llama, GPT-4, Claude), nor whether cross-family author→user transfer remains robust.
  • Author model selection criteria: The work observes that a strong task performer (35B on DocVQA) may be a weak skill author; it does not identify predictive properties (e.g., chain-of-thought quality, calibration, self-evaluation skills) that determine which model is best for authoring.
  • Iterative evolution dynamics: The pipeline is applied as a single offline pass (Stage 1–3). It lacks analysis of multi-round evolution (running additional cycles of trajectory generation and consolidation), convergence behavior, and how skill quality evolves over repeated iterations.
  • Scaling to larger experience pools: Experiments use hundreds of trajectories; it is unclear how the hierarchical merge, conflict detection, and prevalence-weighted induction scale to thousands or millions of traces in terms of latency, memory, and merge quality.
  • Prevalence thresholds and merge policy: The “prevalence-weighted” consolidation is conceptually described but not formalized (e.g., thresholds, weighting, tie-breaking). There is no ablation on how different prevalence rules affect generalization and stability.
  • Conflict-resolution limitations: Patches targeting the same file-line ranges are rejected, which can drop complementary or nuanced alternatives. There is no semantic conflict resolution (e.g., hierarchical policy priorities) or tie-breaking beyond line-range collisions.
  • Self-confirmation bias: Using the same LLM for trajectory generation, patching, and merging risks reinforcing its own biases. There is no analysis of whether a separate “editor” model or ensemble mitigates confirmation bias or improves generalization.
  • Ground-truth dependence: Error analysts compare to gold answers to verify causal fixes. Many real-world settings lack clean ground truth. The approach for noisy labels, partial credit, or proxy signals (e.g., rewards, human feedback) is not investigated.
  • Robustness to noisy or adversarial experience: There is no study of how the framework handles mislabeled data, poisoned trajectories, or adversarial patches, and no defenses beyond basic guardrails.
  • Safety and ethical filtering: The paper does not specify safeguards to prevent distilling harmful, biased, or policy-violating behaviors from trajectories into skills, or mechanisms for safety review and red-teaming of evolved skills.
  • Success-patch volatility: Authors note success-derived patches are volatile and sometimes harmful. There is no concrete method to filter/weight success patches (e.g., causal tests, uncertainty estimates, diversity penalties) beyond merge consolidation.
  • OOD generalization breadth: OOD tests largely remain within similar task structures (e.g., WikiTQ converted to spreadsheets). Generalization to fundamentally different domains, tools, and interaction patterns (e.g., web agents, robotics, API orchestration) is not explored.
  • Multi-skill composition: The framework assumes one comprehensive skill per domain. It does not study how to compose multiple skills for multi-domain tasks or how to modularize skills to avoid oversized, monolithic prompts.
  • Prompt budget and inference cost: Skills are injected into the system prompt. The paper does not quantify context-length overhead, latency, or cost impacts at inference time, especially as skills grow with evolution.
  • Script/resource generation and security: Auxiliary scripts/resources are mentioned, but there’s no analysis of how they are generated, validated, sandboxed, or audited, nor of the security risks of executing evolved scripts.
  • Human-centered evaluation: Skill quality is assessed by downstream task metrics only. There is no human evaluation of skill clarity, actionability, maintainability, or usability for human operators.
  • Reproducibility and determinism: Merging is LLM-driven and may be non-deterministic. Details on random seeds, temperatures, and prompt variants for consolidation are not reported, and variance across runs of Trace2Skill is not quantified.
  • Baseline breadth and tuning: The retrieval-memory baseline uses one embedding model and top-1 retrieval; broader baselines (e.g., top-k, reranking, hybrid retrieval+skill, memory editing) and hyperparameter sweeps are not presented, raising fairness questions.
  • Comparison to parameter-updating methods: The paper argues for no-update portability but does not compare to light-weight fine-tuning (e.g., LoRA) or preference optimization on the same experience, nor hybrid approaches (skills + small parameter updates).
  • Detection of overfitting or negative transfer: Some configurations degrade performance. There is no automated criterion to prevent overfitting during evolution (e.g., held-out validation, regularization, patch confidence scores) or to roll back harmful patches.
  • Domain boundary definition: How to decide when to create a new skill vs. extend an existing one is unspecified; automated domain segmentation and taxonomy management for large organizations remains unsolved.
  • Skill growth control: As patches accumulate, skills may bloat. There is no mechanism for pruning, compressing, or re-factoring skills, nor measurements of skill size growth and its effect on performance.
  • Partial observability and long-horizon settings: The approach is not evaluated in environments with sparse rewards, long horizons, or partial observability where causal attribution is harder and success is rare.
  • Data privacy: Using real trajectories may entail sensitive information; there is no discussion of privacy-preserving evolution (e.g., redaction, differential privacy) or compliance constraints.
  • Generality across languages and modalities: Aside from DocVQA, there is little exploration of multilingual tasks or richer multimodal settings; the portability of distilled skills across languages and diverse input formats is untested.
  • Author–user mismatch policies: When author and user models differ, especially across families, there is no guidance on when an authored skill should be accepted, adapted, or re-authored, nor on automatic compatibility checks.
  • Fine-grained causal attribution: Although error analysts aim for causal fixes, the evaluation does not isolate which specific patches drive gains; more granular A/B testing of patches or rule-level attribution remains open.
  • Continual maintenance under drift: The paper focuses on static datasets; managing skill freshness and validity under non-stationary task distributions and tools (e.g., software updates) is not addressed.
  • Formal guarantees: There are no theoretical guarantees about the inductive merging process (e.g., bounds on error propagation, conditions for generalization), leaving the method empirically justified but not formally grounded.

Practical Applications

Immediate Applications

The following applications can be deployed with today’s LLM agents, standard logging, and the Trace2Skill workflow (trajectory collection → parallel analyst patches → consolidated skill/SOP).

  • Enterprise spreadsheet automation — Software, Finance
    • Use case: Convert employee/agent spreadsheet task logs (e.g., reporting, reconciliation, forecasting) into a single, robust SOP that reduces formula/tool misuse and systematically addresses recurring errors.
    • Tools/workflow: Integrate Trace2Skill with RPA/BI platforms (e.g., Power Automate, Google Sheets Apps Script) and SpreadsheetBench-like evaluators; deploy consolidated SKILL.md as a system prompt for spreadsheet agents.
    • Assumptions/dependencies: Access to representative spreadsheet trajectories with pass/fail signals; permission to instrument file I/O; basic computing resources for parallel analysis.
  • Document processing SOP evolution — Software, Public Sector, Finance, Insurance
    • Use case: Improve document Q&A/extraction accuracy for forms, invoices, claims, FOIA requests by distilling DocVQA error patterns into a reusable, auditable skill (e.g., layout-aware reading order, cross-field validation).
    • Tools/workflow: Connect to existing DocVQA pipelines; feed validation metrics (ANLS/accuracy) and failure cases into Trace2Skill; redeploy evolved SKILL.md with tool-use templates (OCR, layout parsers).
    • Assumptions/dependencies: Sufficient diversity of documents/questions; evaluation harness to label success/failure; data privacy controls.
  • Customer support and back-office SOP consolidation — Customer Service, Operations
    • Use case: Turn logs from ticket resolution agents (email/chat) into a consolidated SOP that captures best resolution patterns and prevents frequent failure modes (e.g., missing KYC step, wrong escalation path).
    • Tools/workflow: Ingest agent ReAct traces and tool calls (CRM queries, knowledge base retrieval); run parallel analysts; merge into a concise resolution SOP and decision tree.
    • Assumptions/dependencies: Structured logging of tool usage and outcomes; basic ground-truth labels (customer satisfaction/first-contact resolution).
  • Compliance-ready SOPs with audit trails — Finance, Insurance, Public Sector
    • Use case: Produce auditable, declarative SOPs derived from historical agent decisions to support regulatory reviews (e.g., underwriting checklists, AML/KYC verification steps) without relying on retrieval modules.
    • Tools/workflow: Maintain versioned SKILL.md with diff-style patch history from Trace2Skill’s consolidation; integrate into compliance review dashboards.
    • Assumptions/dependencies: Access to compliant, de-identified traces; governance for human approval of evolved SOPs.
  • Data analysis assistants for BI/SQL — Software, Analytics
    • Use case: Distill common success/failure patterns in SQL/BI agent traces (e.g., joins, aggregation pitfalls, schema discovery) into a skill that improves query correctness and dashboard creation.
    • Tools/workflow: Capture SQL execution traces and tests; run analysts; merge into a “DataOps SKILL.md” with schema probing steps and error-prevention checklists.
    • Assumptions/dependencies: Query outcome labels (tests, golden answers); permission to introspect schemas and sample data.
  • Math tutoring assistants with mistake-aware guidance — Education
    • Use case: From student/agent solution traces, extract recurrent reasoning mistakes and winning strategies to create a teaching skill (e.g., unit tracking, boundary checks, exploit symmetries).
    • Tools/workflow: Use problem sets with auto-grading; run error analysts to identify causal mistakes; deploy a distilled SKILL.md as a tutoring prompt for step-by-step guidance.
    • Assumptions/dependencies: Access to labeled solutions; coverage across problem types; instructor oversight for pedagogy.
  • Runbook evolution for SRE/IT operations — Software/DevOps
    • Use case: Convert incident response timelines and bot-assisted remediation logs into consolidated, conflict-free runbooks (e.g., prioritized diagnostics, safe rollback steps).
    • Tools/workflow: Instrument ChatOps/ITSM bot traces; apply Trace2Skill; publish updated runbooks in internal wikis; enforce format validation and conflict checks.
    • Assumptions/dependencies: Reliable labeling of successful mitigations; secure handling of logs; change-control approvals.
  • Model-agnostic skill packaging for cross-vendor portability — Software Procurement, IT
    • Use case: Create standard SKILL.md packages from one model’s runs that improve other models’ performance, reducing vendor lock-in.
    • Tools/workflow: Author with open-source models (e.g., Qwen-35B) and deploy with larger/proprietary models; maintain versioned skill registries.
    • Assumptions/dependencies: Access to similar tool stacks across models; comparable prompting interfaces.
  • Retrieval-free agent optimization in constrained environments — Edge/On-prem
    • Use case: Replace retrieval/memory modules with compact SOPs to save latency and simplify deployment (e.g., air-gapped or low-latency settings).
    • Tools/workflow: Prepend evolved SKILL.md to system prompts; disable retrieval; monitor downstream KPIs.
    • Assumptions/dependencies: Sufficient trajectory diversity to produce robust skills; careful prompt-token budgeting.
  • SkillOps in CI/CD — Software Engineering
    • Use case: Treat skills like code: auto-generate patches from fresh traces, enforce linting/validation, run A/B skill tests, and auto-rollback on regressions.
    • Tools/workflow: Add a “SkillOps” stage in CI; include format checkers, conflict detection, and benchmark gates before merge-to-main.
    • Assumptions/dependencies: Test suites for target tasks; version control for skills; policies for human-in-the-loop approvals.
  • Internal training content generation — HR/Training, Education
    • Use case: Turn logs from trainees/interactions with training agents into concise, role-specific SOPs and checklists (e.g., onboarding, compliance training).
    • Tools/workflow: Aggregate exercise traces; apply analysts; produce SKILL.md with level-appropriate scaffolding and examples.
    • Assumptions/dependencies: Consent to use training logs; coverage of key scenarios.
  • Public-sector digital services workflows — Government Services
    • Use case: Improve accuracy and consistency of case-processing agents (permits, benefits) by deriving standardized, explainable SOPs from prior case logs.
    • Tools/workflow: Instrument service bots; apply Trace2Skill; publish SOPs internally for audit and cross-agency reuse.
    • Assumptions/dependencies: De-identification, legal clearance, transparent change management.

Long-Term Applications

These applications require further research, scale-up, domain validation, or integration with safety-critical processes.

  • Safety-critical clinical SOP distillation — Healthcare
    • Use case: Derive declarative SOPs for administrative clinical workflows (e.g., coding, prior auth, scheduling) and, with rigorous validation, extend to clinical decision support (triage protocols).
    • Tools/workflow: Combine Trace2Skill with medical ontologies (ICD/CPT), EHR tool integration, and human expert review boards; maintain audit trails and versioned releases.
    • Assumptions/dependencies: Strict oversight, clinical validation, bias analysis, and regulatory approval; high-quality ground truth.
  • Robotics task generalization from demonstrations — Robotics, Manufacturing
    • Use case: Convert heterogeneous robot execution logs/demos into language-level skills that specify step sequences, failure checks, and recovery behaviors for assembly or warehouse tasks.
    • Tools/workflow: Translate robot sensor/action logs into structured “trajectories” consumable by LLM analysts; integrate with motion planning/tool APIs.
    • Assumptions/dependencies: Reliable log-to-text abstraction; safety certification; synchronization between symbolic SOP and low-level controllers.
  • Grid and plant operations SOPs — Energy, Utilities
    • Use case: From historical outage/maintenance interventions, distill robust SOPs that prioritize diagnostics and ensure safety barriers, supporting operator copilots.
    • Tools/workflow: Couple with simulators/digital twins to generate labeled success/failure trajectories; continuously evaluate on scenario sets.
    • Assumptions/dependencies: Safety-critical validation, expert sign-off, high-fidelity logs and simulators.
  • Financial risk and audit copilot skills — Finance
    • Use case: Evolve skills for reconciliations, risk reviews, audit sampling, and investigation workflows that consistently enforce controls and detect anomalies.
    • Tools/workflow: Integrate with ledger/datawarehouse tools; feed back audit outcomes and exception cases; maintain segregated, audited skill registries.
    • Assumptions/dependencies: Regulatory alignment, traceability, de-identification, model governance.
  • Cross-organization skill exchanges and marketplaces — Software Ecosystem
    • Use case: Share portable, model-agnostic SKILL.md packages across teams/orgs (e.g., “Doc Intake v3”), with provenance, metadata, and performance cards.
    • Tools/workflow: Build a “SkillHub” with signing, versioning, and compatibility tags; automated benchmarking on community suites.
    • Assumptions/dependencies: Standardized skill schemas; trust and security frameworks; IP and licensing models.
  • Agent governance and policy frameworks — Policy, Compliance
    • Use case: Mandate explicit, auditable SOPs for deployed AI agents; require change logs and benchmark-based gating for skill updates; certify skills for critical domains.
    • Tools/workflow: Reference Trace2Skill-like pipelines in regulatory guidance; require retention of trajectory evidence and patch histories.
    • Assumptions/dependencies: Cross-stakeholder consensus, capability to run compliance test batteries.
  • Meta-skill learning for multi-domain agents — Software, Research
    • Use case: Develop higher-level “skill induction” patterns that automatically choose which analyst configurations and merge heuristics work best per domain.
    • Tools/workflow: Auto-tune merge batch sizes, prevalence thresholds, and analyst prompts via meta-optimization; maintain per-domain performance profiles.
    • Assumptions/dependencies: Large, diverse trajectory corpora; rigorous ablation infrastructure.
  • Human-in-the-loop supervisory consoles — Software Tools
    • Use case: Provide reviewers with explainable patch proposals, prevalence stats, and conflict flags; enable selective merges and targeted regress tests.
    • Tools/workflow: Build a GUI over Trace2Skill artifacts (patch pools, diffs, validation results); integrate with ticketing/approval systems.
    • Assumptions/dependencies: Usable UX, organizational processes for review, role-based access control.
  • Personalized assistants that learn household/office SOPs — Daily Life, SMB
    • Use case: From repeated user interactions (budgeting, scheduling, email triage), distill private SOPs that reduce mistakes and standardize outputs across devices.
    • Tools/workflow: Local/on-device or private-cloud pipelines; incremental trajectory collection; privacy-preserving skill evolution.
    • Assumptions/dependencies: Consent and privacy guarantees; sufficient task repetition and labeling; lightweight compute.
  • Cross-modal, tool-rich agents with unified skills — Multimodal AI
    • Use case: Create SOPs spanning text, vision, and structured tools (e.g., “read document + extract table + reconcile with spreadsheet + draft memo”), enabling end-to-end workflows.
    • Tools/workflow: Aggregate multimodal trajectories; harmonize evaluation metrics; extend consolidation to multi-file, multi-tool conflict detection.
    • Assumptions/dependencies: Mature multimodal toolchains; coherent logging across modalities; robust evaluators.
  • Continual skill evolution with drift detection — AgentOps
    • Use case: Monitor performance drift (new document templates, schema changes) and trigger targeted re-evolution; maintain backward-compatible skill branches.
    • Tools/workflow: Drift detectors, canary tests, scheduling for periodic Trace2Skill runs, automatic rollback strategies.
    • Assumptions/dependencies: Stable telemetry; safe deployment practices; clear SLAs/SLIs.
  • Standardized benchmarks and skill-linters for reproducibility — Academia, Open Source
    • Use case: Provide open, domain-specific evolution sets and test suites plus “skill linters” to enforce clarity, actionability, and conflict-free structure.
    • Tools/workflow: Publish reference Trace2Skill implementations, patch validators, and reporting templates; encourage community skill submissions.
    • Assumptions/dependencies: Funding/maintenance for benchmarks; community participation; licensing clarity.

Notes on feasibility and dependencies across applications

  • Ground-truthing: Most deployments need a way to label success/failure (tests, business KPIs, human adjudication). Quality of skills depends on this signal.
  • Trajectory quality: Rich, instrumented traces (reasoning, tool calls, observations) materially improve error analysis; sparse logs reduce efficacy.
  • Analyst design: Error analysts were more reliable than success analysts; conservative merge policies (prevalence-weighted, conflict checks) improve generalization.
  • Generalization: Cross-model and OOD transfer was strong in many cases but not guaranteed (e.g., 35B-authored VQA skill underperformed); validate per domain.
  • Governance and safety: For regulated or high-stakes domains, human review, versioning, audits, and benchmark gates should be mandatory.
  • Compute: Stage-2 parallelism accelerates throughput; minimal GPU-hours are needed for moderate-scale evolutions, but large-scale or multimodal scenarios may require more resources.

Glossary

  • Agent swarms: Multi-agent systems where many lightweight agents operate in parallel to process information or tasks efficiently. "This reflects the core design wisdom of agent swarms \citep{kimi2026agentswarm}, which process multiple information sources efficiently using parallelized sub-agents."
  • Agentic error analysis: An interactive, tool-using diagnostic process where an agent iteratively investigates failures to identify root causes and propose fixes. "Agentic error analysis produces more transferable patches."
  • ANLS (Average Normalized Levenshtein Similarity): A string-similarity metric (normalized edit distance) commonly used to evaluate answer quality in document VQA. "We report ANLS (Average Normalized Levenshtein Similarity, the official metric) and Accuracy (ANLS~0.5\geq 0.5, \%)."
  • Compositional semantic parsing: Mapping natural language questions into logical forms by composing operators over semi-structured data like tables. "which differs in data source (Wikipedia) and task type (compositional semantic parsing);"
  • Conflict-Free Consolidation: A stage that merges many proposed edits into a single coherent update while programmatically preventing conflicts and format errors. "(3) Conflict-Free Consolidation: Sub-agent-proposed patches are hierarchically merged into a coherent update to the skill directory, utilizing programmatic conflict detection and format validation at each step."
  • Declarative skills: Skill documents encoded as explicit, human-readable procedures and rules rather than learned parameters, enabling portability without retraining. "transferable, declarative skills—requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters."
  • Diff-style edit operations: Programmatic file changes represented as line-based additions/deletions (diffs), enabling deterministic application of patches. "The final pp^* is translated into diff-style edit operations and applied programmatically."
  • DocVQA: A benchmark for document visual question answering that requires reasoning over text-rich images like forms and invoices. "we apply it to Visual Question Answering (VQA) using DocVQA~\cite{mathew2020docvqa} as the target benchmark."
  • Error Analyst: A specialized sub-agent role that performs interactive, ReAct-style diagnosis on failure trajectories to propose grounded fixes. "Error Analyst (A)\mathcal{A}^-)."
  • Episodic memories: Stored, trajectory-specific experiences retrieved later to aid problem solving; contrasted here with distilled, portable skills. "This challenges the common assumption that experience is inherently model- and task-specific and must be managed through the retrieval of episodic memories \citep{ouyang2026reasoningbankscalingagentselfevolving,wang2024agentworkflowmemory,qian2024investigateconsolidateexploitgeneralstrategyintertask,nottingham2024skillsetoptimizationreinforcing,liu2025contextualexperiencereplayselfimprovement}."
  • Evolving set: A dynamically growing set of tasks used to collect diverse execution trajectories for learning or analysis. "Stage~1: a frozen agent πθ\pi_\theta rolls out on the evolving set using an initial skill S0\mathcal{S}_0 (human-written or LLM-drafted), producing labeled trajectories T\mathcal{T}^- (failures) and T+\mathcal{T}^+ (successes)."
  • Frozen agent: An agent whose model parameters remain fixed (no fine-tuning) while only external skill documents change. "Stage~1: a frozen agent πθ\pi_\theta rolls out on the evolving set using an initial skill S0\mathcal{S}_0 (human-written or LLM-drafted), producing labeled trajectories T\mathcal{T}^- (failures) and T+\mathcal{T}^+ (successes)."
  • Hierarchical merge: A multi-level synthesis procedure that combines groups of patches stepwise, deduplicating and resolving conflicts at each level. "The hierarchical merge then performs inductive reasoning over the full population of trajectory-local observations simultaneously, selecting patterns that recur across diverse trajectories rather than patterns that recur in the most recent updates."
  • Inductive reasoning: Inferring general rules from many specific examples; used here to abstract common patterns from trajectory-derived patches. "First, this acts as an inductive reasoning process \citep{xiong-etal-2025-co,li2025mirageevaluatingexplaininginductive,lin2025llmbasedscientificinductivereasoning} that mines generalizable patterns from experience-specific patches, building a high-level understanding of the domain analogous to a human expert's prior knowledge."
  • Mixture-of-Experts (MoE): A model architecture that routes inputs to different expert subnetworks to improve capacity and efficiency. "We experiment with two Qwen3.5 MoE models: Qwen3.5-122B-A10B and Qwen3.5-35B-A3B."
  • Out-of-distribution (OOD): Evaluation on data drawn from a different distribution than the training/evolving set, testing generalization. "For out-of-distribution (OOD) generalization, we evaluate on WikiTableQuestions \citep{pasupat2015compositionalsemanticparsingsemistructured} (WikiTQ), which differs in data source (Wikipedia) and task type (compositional semantic parsing);"
  • Parametric knowledge: Knowledge encoded in a model’s parameters (from pretraining/fine-tuning) as opposed to external, explicit documents or tools. "However, synthesizing skills relying solely on an LLM's parametric knowledge yields limited benefits, even with leading proprietary models, primarily because parametric knowledge lacks information about the specifics and common pitfalls of the target domain"
  • ReAct: A prompting framework that interleaves reasoning steps with tool-use actions to generate trajectories with observations. "We adopt ReAct \citep{yao2023reactsynergizingreasoningacting} as the agent harness."
  • Reasoning Bank: A retrieval-based baseline that stores and later retrieves generalizable lessons from trajectories to guide inference. "Reasoning Bank \citep{ouyang2026reasoningbankscalingagentselfevolving} that first saves generalizable lessons from each trajectory, and retrieve useful experiences at inference time based on task similarity."
  • Retrieval index: A searchable memory structure used to fetch past experiences at inference time; Trace2Skill avoids needing one. "The evolved skill S=(M,R)\mathcal{S}^* = (M^*, \mathcal{R}^*) replaces S0\mathcal{S}_0 and is used directly at inference without any retrieval index."
  • Retrieval-based reasoning banks: Systems that rely on similarity search to fetch prior reasoning episodes during inference rather than distilling them into a skill. "single comprehensive skill outperforms retrieval-based reasoning banks;"
  • Skill creation from scratch: Building a useful skill document starting from a weak, parametric-only draft by grounding it in trajectory analysis. "Skill creation from scratch."
  • Skill deepening: Refining an existing human-written skill by integrating trajectory-derived successes and failure fixes. "Skill deepening."
  • Skill directory: The organized, multi-file representation of a skill (e.g., SKILL.md plus resources) used to guide an agent. "a unified, conflict-free skill directory via inductive reasoning."
  • Skill evolution: The process of improving a skill document using trajectory evidence without changing model parameters. "The objective of skill evolution is to construct an improved skill from trajectories on Devolve\mathcal{D}_\text{evolve}, without updating θ\theta, such that:"
  • Skill patch: A proposed, localized edit to the skill (instructions, checklists, scripts) derived from analyzing a single trajectory. "and outputs a skill patch:"
  • Success Analyst: A sub-agent that extracts generalizable, effective strategies from successful trajectories to reinforce in the skill. "Success Analyst (A+)\mathcal{A}^+)."
  • Trajectory: A sequence of reasoning steps, tool calls, observations, and outcomes produced by an agent solving a task. "yielding a trajectory:"
  • Trajectory-local lessons: Insights specific to individual executions that may not generalize unless consolidated across many trajectories. "sequentially overfits to non-generalizable trajectory-local lessons."
  • vLLM: A high-throughput LLM serving system optimized for efficient inference. "Models are served with vLLM \citep{kwon2023efficientmemorymanagementlarge} using the recommended Qwen3.5 generation configuration"
  • WikiTableQuestions (WikiTQ): A benchmark for answering questions over semi-structured Wikipedia tables, used here to test transfer. "we evaluate on WikiTableQuestions \citep{pasupat2015compositionalsemanticparsingsemistructured} (WikiTQ)"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 116 likes about this paper.