Trace2Skill Framework
- Trace2Skill is a methodology that transforms raw execution traces into declarative, transferable skills, enhancing LLM agents' generalizability.
- It employs parallel patch extraction and hierarchical merging to distill standard operating procedures and failure checklists from diverse agent trajectories.
- Empirical results demonstrate significant performance gains and robust out-of-distribution transfer across various domains and model scales.
The Trace2Skill framework is a methodology for converting raw agent execution traces into transferable, declarative skills that substantially enhance the generalizability and robustness of LLM agents. By orchestrating parallel analysis of diverse agent trajectories and consolidating resulting lessons through hierarchical merging, Trace2Skill systematically distills domain-general standard operating procedures (SOPs), tool-use guidelines, and failure-checklists, eliminating the overfitting and fragmentation common in traditional sequential or parametric approaches. Evolved skills are encoded as static skill directories and have demonstrated significant empirical gains over handcrafted, parametric, and retrieval-based baselines, including robust out-of-distribution (OOD) transfer across LLM scales and task domains (Ni et al., 26 Mar 2026).
1. Motivation and Core Challenges
Trace2Skill was motivated by the bottlenecks of skill authoring for LLM agents in domains such as spreadsheet manipulation, VisionQA, and mathematical reasoning. Manually written skills are time-consuming to produce and often fail to transfer across models or adapt to novel task variants, occasionally degrading performance in weaker agents or in OOD settings. Automated skill generation from LLMs’ world knowledge, lacking direct exposure to agent errors, produces shallow, brittle skills. Sequential, experience-driven editing—where new lessons are appended to skills as failures are encountered—tends to yield fragmented, overly specialized skills that are hard to maintain and inefficient to retrieve.
Trace2Skill addresses these issues by emulating expert skill formation: aggregating broad execution experience before distilling it into a coherent, singular skill document. This approach focuses on maximizing generalizability and minimizing the risk of model- or instance-specific overfitting (Ni et al., 26 Mar 2026).
2. Methodology: Parallel Patch Extraction and Hierarchical Consolidation
The Trace2Skill pipeline comprises three primary stages:
Stage 1: Trajectory Generation
A frozen agent operates on an evolving set of tasks, each time using the current skill . Trajectories are recorded and labeled as successes () or failures (). Each trajectory consists of a query , reasoning logs , tool actions , observations 0, and a final success indicator 1.
Stage 2: Parallel Patch Proposal
A parallel fleet of sub-agents processes each trajectory:
- Success Analysts (2): Extract generalizable positive patterns from successful traces (3).
- Error Analysts (4): Use a multi-turn ReAct-style loop to diagnose failures (5), inspect artifacts, propose minimal corrective patches, and validate causal contributions. Each analyst emits a trajectory-specific patch 6—a structured edit/diff to 7.
Stage 3: Conflict-Free Hierarchical Skill Merging
All patches 8 are recursively batched and merged by an LLM-based operator 9, forming higher-level patches up to a final, consolidated patch 0. Key constraints enforced during merging:
- Edits targeting nonexistent file regions are rejected.
- Overlapping line-range edits are flagged and withheld.
- Only sufficiently prevalent edits (appearing in at least a 1 fraction of source patches) are retained, effecting an inductive generalization bias. The final evolved skill 2 is obtained as 3. Skill directories consist of a root SKILL.md (Markdown SOP/checklists) and auxiliary resources (scripts, references), represented internally as JSON objects encoding operation type, edit target, and content.
Pseudocode Outline
3. Formal Characterization and Inductive Bias
The core objects are:
- 4: full trajectory
- 5: set of lessons—generalizable patterns/failures—extracted by analysts
- 6: encoded as structured diffs to 7
- 8: hierarchical LLM-based consolidation
- Conflict detection: Intervals 9 of edits conflict if their ranges overlap; conflicting edits are flagged and not merged
During merge, edits must satisfy a prevalence threshold 0 to be admitted, ensuring only systematically recurring lessons from diverse traces are encoded. The skill-evolution objective is to maximize 1, where 2.
4. Modes of Skill Evolution
Trace2Skill supports two operation modes:
- Deepening: Input 3 is a human-written SOP; Trace2Skill appends synthesized failure checklists and reinforces successful practice patterns.
- Creation: Input 4 is a parametric LLM draft or empty; Trace2Skill constructs a functional skill directory de novo, grounded exclusively in trajectory evidence. Resulting skill files are static and declarative, with no runtime retrieval or parameter update requirements. Internally, skills and patches are represented as JSON objects and mapped to unified-diff hunks. The LLM merge operator enforces deduplication, atomic reference creation/linking, and conflict resolution.
5. Empirical Results and Performance Gains
Trace2Skill was benchmarked on SpreadsheetBench-Verified, DocVQA, and DAPO-Math domains against baselines including:
- No Skill
- Human-written skills (e.g., Anthropic’s xlsx skill)
- Parametric skills (auto-drafted by Qwen3.5-122B)
- Retrieval-based ReasoningBank
- Sequential online editing
Tables of key results:
| Setting | Baseline (%) | Trace2Skill (%) | Gain (pp) |
|---|---|---|---|
| SpreadsheetBench Verified | 48.3 | 69.8 | +21.5 |
| Parametric + Error | 26.2 | 49.0 | +22.8 |
| ReasoningBank | 56.0 | n/a | n/a |
| DocVQA (ANLS, No Skill) | 0.6424 | 0.8063 | +0.1639 |
| DAPO-Math (No Skill) | 92.0 | 95.0 | +3 |
Trace2Skill’s trajectory-grounded skills consistently outperform baseline and retrieval-based approaches. Notably, in transfer evaluations, skills distilled by Qwen3.5-35B led to +57.7 absolute points in WikiTableQuestions OOD accuracy for a Qwen3.5-122B agent; 122B-authored skills transferred to weaker models with marked improvements (+5pp in math, +13.6pp in DocVQA) (Ni et al., 26 Mar 2026).
6. Generalization Across Models and Tasks
A central property of Trace2Skill is strong cross-model and cross-domain transfer. Skills are not tuned to idiosyncrasies of any one LLM or task variant. The inductive merge process selectively admits only patterns that recur across diverse trajectories, focusing on universally valid practices (e.g., “after editing formulas, always run recalc.py,” “read back cell values to verify”). This yields skills that apply robustly across model scales and OOD settings without parameter tuning or retrieval (Ni et al., 26 Mar 2026).
7. Limitations and Future Work
Trace2Skill’s inductive merge and prevalence filtering mechanism, while critical for generalization, can occlude infrequent but crucial edge-case lessons if the prevalence threshold is set improperly. The LLM merge operator may not always fully resolve subtle semantic conflicts between SOPs. For limited, homogeneous trajectory sets, overfitting to observed data remains possible. Potential remedies include:
- Causal patch ablation: Systematically ablating individual edits to quantify downstream performance impact.
- Usage attribution: Mapping agent decisions to specific skill sections to enable pruning of unused instructions.
- Adaptive prevalence: Variable thresholding for different categories of SOPs.
- Human-in-the-loop modes: Integrating expert review in domains with high reliability requirements (Ni et al., 26 Mar 2026).
Future instantiations may further automate or tighten verification, support fine-grained edit management, and broaden applicability to non-LLM-based agentic frameworks.
References
- "Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills" (Ni et al., 26 Mar 2026)