Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trace2Skill Framework

Updated 28 April 2026
  • Trace2Skill is a methodology that transforms raw execution traces into declarative, transferable skills, enhancing LLM agents' generalizability.
  • It employs parallel patch extraction and hierarchical merging to distill standard operating procedures and failure checklists from diverse agent trajectories.
  • Empirical results demonstrate significant performance gains and robust out-of-distribution transfer across various domains and model scales.

The Trace2Skill framework is a methodology for converting raw agent execution traces into transferable, declarative skills that substantially enhance the generalizability and robustness of LLM agents. By orchestrating parallel analysis of diverse agent trajectories and consolidating resulting lessons through hierarchical merging, Trace2Skill systematically distills domain-general standard operating procedures (SOPs), tool-use guidelines, and failure-checklists, eliminating the overfitting and fragmentation common in traditional sequential or parametric approaches. Evolved skills are encoded as static skill directories and have demonstrated significant empirical gains over handcrafted, parametric, and retrieval-based baselines, including robust out-of-distribution (OOD) transfer across LLM scales and task domains (Ni et al., 26 Mar 2026).

1. Motivation and Core Challenges

Trace2Skill was motivated by the bottlenecks of skill authoring for LLM agents in domains such as spreadsheet manipulation, VisionQA, and mathematical reasoning. Manually written skills are time-consuming to produce and often fail to transfer across models or adapt to novel task variants, occasionally degrading performance in weaker agents or in OOD settings. Automated skill generation from LLMs’ world knowledge, lacking direct exposure to agent errors, produces shallow, brittle skills. Sequential, experience-driven editing—where new lessons are appended to skills as failures are encountered—tends to yield fragmented, overly specialized skills that are hard to maintain and inefficient to retrieve.

Trace2Skill addresses these issues by emulating expert skill formation: aggregating broad execution experience before distilling it into a coherent, singular skill document. This approach focuses on maximizing generalizability and minimizing the risk of model- or instance-specific overfitting (Ni et al., 26 Mar 2026).

2. Methodology: Parallel Patch Extraction and Hierarchical Consolidation

The Trace2Skill pipeline comprises three primary stages:

Stage 1: Trajectory Generation

A frozen agent πθ\pi_\theta operates on an evolving set DeD_e of tasks, each time using the current skill S0S_0. Trajectories τ1,...,τN\tau_1, ..., \tau_N are recorded and labeled as successes (y=1y=1) or failures (y=0y=0). Each trajectory τ\tau consists of a query qq, reasoning logs rkr_k, tool actions aka_k, observations DeD_e0, and a final success indicator DeD_e1.

Stage 2: Parallel Patch Proposal

A parallel fleet of sub-agents processes each trajectory:

  • Success Analysts (DeD_e2): Extract generalizable positive patterns from successful traces (DeD_e3).
  • Error Analysts (DeD_e4): Use a multi-turn ReAct-style loop to diagnose failures (DeD_e5), inspect artifacts, propose minimal corrective patches, and validate causal contributions. Each analyst emits a trajectory-specific patch DeD_e6—a structured edit/diff to DeD_e7.

Stage 3: Conflict-Free Hierarchical Skill Merging

All patches DeD_e8 are recursively batched and merged by an LLM-based operator DeD_e9, forming higher-level patches up to a final, consolidated patch S0S_00. Key constraints enforced during merging:

  • Edits targeting nonexistent file regions are rejected.
  • Overlapping line-range edits are flagged and withheld.
  • Only sufficiently prevalent edits (appearing in at least a S0S_01 fraction of source patches) are retained, effecting an inductive generalization bias. The final evolved skill S0S_02 is obtained as S0S_03. Skill directories consist of a root SKILL.md (Markdown SOP/checklists) and auxiliary resources (scripts, references), represented internally as JSON objects encoding operation type, edit target, and content.

Pseudocode Outline

τ1,...,τN\tau_1, ..., \tau_N5 (Ni et al., 26 Mar 2026)

3. Formal Characterization and Inductive Bias

The core objects are:

  • S0S_04: full trajectory
  • S0S_05: set of lessons—generalizable patterns/failures—extracted by analysts
  • S0S_06: encoded as structured diffs to S0S_07
  • S0S_08: hierarchical LLM-based consolidation
  • Conflict detection: Intervals S0S_09 of edits conflict if their ranges overlap; conflicting edits are flagged and not merged

During merge, edits must satisfy a prevalence threshold τ1,...,τN\tau_1, ..., \tau_N0 to be admitted, ensuring only systematically recurring lessons from diverse traces are encoded. The skill-evolution objective is to maximize τ1,...,τN\tau_1, ..., \tau_N1, where τ1,...,τN\tau_1, ..., \tau_N2.

4. Modes of Skill Evolution

Trace2Skill supports two operation modes:

  • Deepening: Input τ1,...,τN\tau_1, ..., \tau_N3 is a human-written SOP; Trace2Skill appends synthesized failure checklists and reinforces successful practice patterns.
  • Creation: Input τ1,...,τN\tau_1, ..., \tau_N4 is a parametric LLM draft or empty; Trace2Skill constructs a functional skill directory de novo, grounded exclusively in trajectory evidence. Resulting skill files are static and declarative, with no runtime retrieval or parameter update requirements. Internally, skills and patches are represented as JSON objects and mapped to unified-diff hunks. The LLM merge operator enforces deduplication, atomic reference creation/linking, and conflict resolution.

5. Empirical Results and Performance Gains

Trace2Skill was benchmarked on SpreadsheetBench-Verified, DocVQA, and DAPO-Math domains against baselines including:

  • No Skill
  • Human-written skills (e.g., Anthropic’s xlsx skill)
  • Parametric skills (auto-drafted by Qwen3.5-122B)
  • Retrieval-based ReasoningBank
  • Sequential online editing

Tables of key results:

Setting Baseline (%) Trace2Skill (%) Gain (pp)
SpreadsheetBench Verified 48.3 69.8 +21.5
Parametric + Error 26.2 49.0 +22.8
ReasoningBank 56.0 n/a n/a
DocVQA (ANLS, No Skill) 0.6424 0.8063 +0.1639
DAPO-Math (No Skill) 92.0 95.0 +3

Trace2Skill’s trajectory-grounded skills consistently outperform baseline and retrieval-based approaches. Notably, in transfer evaluations, skills distilled by Qwen3.5-35B led to +57.7 absolute points in WikiTableQuestions OOD accuracy for a Qwen3.5-122B agent; 122B-authored skills transferred to weaker models with marked improvements (+5pp in math, +13.6pp in DocVQA) (Ni et al., 26 Mar 2026).

6. Generalization Across Models and Tasks

A central property of Trace2Skill is strong cross-model and cross-domain transfer. Skills are not tuned to idiosyncrasies of any one LLM or task variant. The inductive merge process selectively admits only patterns that recur across diverse trajectories, focusing on universally valid practices (e.g., “after editing formulas, always run recalc.py,” “read back cell values to verify”). This yields skills that apply robustly across model scales and OOD settings without parameter tuning or retrieval (Ni et al., 26 Mar 2026).

7. Limitations and Future Work

Trace2Skill’s inductive merge and prevalence filtering mechanism, while critical for generalization, can occlude infrequent but crucial edge-case lessons if the prevalence threshold is set improperly. The LLM merge operator may not always fully resolve subtle semantic conflicts between SOPs. For limited, homogeneous trajectory sets, overfitting to observed data remains possible. Potential remedies include:

  • Causal patch ablation: Systematically ablating individual edits to quantify downstream performance impact.
  • Usage attribution: Mapping agent decisions to specific skill sections to enable pruning of unused instructions.
  • Adaptive prevalence: Variable thresholding for different categories of SOPs.
  • Human-in-the-loop modes: Integrating expert review in domains with high reliability requirements (Ni et al., 26 Mar 2026).

Future instantiations may further automate or tighten verification, support fine-grained edit management, and broaden applicability to non-LLM-based agentic frameworks.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trace2Skill Framework.