Ctx2Skill: Context-to-Skill Framework
- Ctx2Skill is a context-to-skill conversion framework that autonomously extracts, refines, and assembles procedural natural-language skills from complex contexts.
- It employs techniques like adversarial self-play, retrieval-conditioned synthesis, and hierarchical consolidation to generate robust skill representations.
- Empirical evaluations demonstrate improved task execution and procedural learning across diverse benchmarks in both context learning and software engineering.
Searching arXiv for papers on Ctx2Skill and closely related context-to-skill methods. Ctx2Skill is a self-evolving, inference-time skill construction framework for context learning in which a LLM autonomously discovers, refines, and selects context-specific natural-language skills from a complex context without human annotation, external feedback, or model-weight updates (Si et al., 30 Apr 2026). More broadly, the name has become a useful organizing lens for a family of methods that transform task context, trajectories, external knowledge records, or skill documents into reusable procedural assets, including Markdown skill files, structured skill contracts, dynamic skill banks, and even LoRA adapters generated at test time (Si et al., 30 Apr 2026).
1. Concept and scope
In the original formulation, “context learning” means learning relevant knowledge directly from a provided context and then using that knowledge to solve tasks whose required information lies outside parametric memory (Si et al., 30 Apr 2026). Ctx2Skill addresses this by replacing repeated online extraction from a long, technically dense context with inference-time skill augmentation: rules and procedures are distilled into a short Markdown skill file that is prepended to the model prompt (Si et al., 30 Apr 2026).
This framing generalizes naturally to several adjacent research programs. In some systems, the source context is a document or manual; in others it is a pool of trajectories, task metadata, failed interactions, or a repository snapshot. Likewise, the target “skill” varies. Trace-oriented systems produce declarative skill folders or SKILL.md files (Ni et al., 26 Mar 2026); retrieval-conditioned systems synthesize temporary task-specific skills at test time (Wang et al., 16 May 2026); memory-oriented systems maintain skill banks with dynamic addition, retrieval, and pruning (Tu et al., 30 Mar 2026); and parametric systems compile textual skills into LoRA adapters so that downstream execution no longer rereads the skill text (Zhao et al., 29 Jun 2026).
A recurring distinction in this literature is between declarative evidence and procedural guidance. Standard RAG retrieves passages or logs, but leaves the agent to infer procedures repeatedly at inference time. Anything2Skill makes this distinction explicit: the goal is to compile latent procedural knowledge from heterogeneous records into reusable skills, so that retrieval provides declarative evidence while the SkillBank provides executable guidance (Pan et al., 8 Jun 2026). This suggests that Ctx2Skill is best understood not as a single architecture, but as a general conversion problem: transforming context into a durable procedure representation.
2. Skill representations
The dominant representation is textual. In Ctx2Skill, the skill set is a short Markdown file prepended to the system prompt (Si et al., 30 Apr 2026). ParametricSkills defines a skill as a structured textual recipe capturing mature problem-solving experience or procedural knowledge; its appendix specifies a canonical structure with purpose, usage conditions, required inputs, implementation recipe, verification checks, failure modes, and anti-patterns (Zhao et al., 29 Jun 2026). Trace2Skill formalizes a skill as , where is the root markdown document and contains scripts, references, and assets (Ni et al., 26 Mar 2026). SkillAdaptor stores skills as SKILL.md records with fields such as title, principle, applicability, procedure, qualification criteria, and negative example (Yu et al., 31 May 2026). XSkill represents a skill as , with metadata, workflow sequence, and reusable tool templates, while experiences are stored separately as compact condition-action items (Jiang et al., 12 Mar 2026).
Anything2Skill adopts the most explicit contract view. Its skill object is
with invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence score (Pan et al., 8 Jun 2026). This contract representation is designed for compilation, reconciliation, versioning, and skill-tree projection rather than mere prompt insertion (Pan et al., 8 Jun 2026).
A second representation family is parametric. ParametricSkills converts free-form skill text into LoRA parameters through a frozen text encoder and hypernetwork: Here is the flattened skill adapter, reshaped into LoRA factors and merged with a frozen base model, enabling what the paper calls “context-free skill exploitation” (Zhao et al., 29 Jun 2026). This shifts the representation from visible text to behavior-modulating weights.
A third axis concerns granularity. D2Skill separates task skills from step skills, arguing that trajectory-level guidance is too coarse for local error correction (Tu et al., 30 Mar 2026). XSkill similarly separates experiences, which are action-level and tactical, from skills, which are task-level and structured (Jiang et al., 12 Mar 2026). This suggests that “skill” in Ctx2Skill research is not a single ontological category; it can denote a plan, a reusable subroutine, a local correction heuristic, or a parameterized adapter, provided it mediates between context and later behavior.
3. Conversion mechanisms
The original Ctx2Skill framework uses a five-agent self-play loop with Challenger, Reasoner, Judge, Proposer, and Generator, all instantiated with frozen LLMs (Si et al., 30 Apr 2026). The Challenger generates task-rubric pairs from context 0; the Reasoner answers them using an evolving Reasoner skill set 1; the Judge returns binary rubric verdicts; and dedicated Proposer/Generator pairs update either the Reasoner or Challenger skills depending on whether tasks were failed or solved (Si et al., 30 Apr 2026). To prevent adversarial collapse, the framework introduces Cross-Time Replay. It stores hard and easy probe sets, evaluates all candidate Reasoner skill sets on both, and selects
2
where 3 and 4 are Laplace-smoothed solving rates on hard and easy probes (Si et al., 30 Apr 2026).
Other systems instantiate different conversion pathways. SkillTTA retrieves a small set of trajectories relevant to a test task 5, then synthesizes a temporary textual skill
6
from target task context and retrieved evidence, with a fixed downstream solver (Wang et al., 16 May 2026). Trace2Skill converts a corpus of agent trajectories into a single reusable skill directory by parallel per-trajectory patch proposal and hierarchical conflict-free consolidation, explicitly treating trajectory-local lessons as raw evidence rather than as final memory (Ni et al., 26 Mar 2026). Anything2Skill decomposes heterogeneous records into evidence windows 7, performs plan-and-expand extraction
8
filters for procedural validity, compiles document-local drafts into canonical skills, and then reconciles them against a persistent registry (Pan et al., 8 Jun 2026).
A distinct path is trajectory-driven maintenance. SkillAdaptor begins from failed trajectories 9, localizes the first actionable fault step
0
links blame to the actually injected skill subset 1,
2
then either revises the highest-weighted skill or generates a new skill, with acceptance gated by re-execution (Yu et al., 31 May 2026). SkillEvolver adopts a related but deployment-centered perspective: a meta-skill authors an initial domain skill from contrastive trajectory evidence, deploys that skill to fresh agents, and refines it from the failures those agents encounter while using the skill rather than from exploratory traces alone (Zhang et al., 11 May 2026).
These mechanisms differ in supervision and timing, but they converge on a common principle: extraction quality depends on moving beyond direct summarization of raw context. Ctx2Skill uses adversarial self-play (Si et al., 30 Apr 2026), SkillTTA uses retrieval-conditioned synthesis (Wang et al., 16 May 2026), Trace2Skill uses prevalence-based hierarchical consolidation (Ni et al., 26 Mar 2026), Anything2Skill uses taxonomy-guided compilation (Pan et al., 8 Jun 2026), and ParametricSkills adds exploitation-trajectory supervision so that the compiled object encodes not only content but also how the skill should be used (Zhao et al., 29 Jun 2026).
4. Retrieval, invocation, and execution control
Once a skill exists, the next problem is deciding how it is retrieved, invoked, or composed with ongoing inference. SkillTTA retrieves top-3 trajectories by cosine similarity over stable task metadata, with default 4, and emphasizes that the target task context is authoritative while retrieved evidence is non-binding (Wang et al., 16 May 2026). XSkill first decomposes the current multimodal task into 2–3 abstract subtasks, uses those to retrieve experiences by embedding similarity, rewrites retrieved experiences to fit the current task and images, and then adapts the global skill document using the rewritten experiences and the current visual context (Jiang et al., 12 Mar 2026). D2Skill performs dual-granularity retrieval with semantic similarity plus utility-aware exploration,
5
and prunes low-value skills when the bank exceeds capacity (Tu et al., 30 Mar 2026).
A major correction to naïve “retrieve if relevant” logic is provided by SelSkill. It argues that a skill can be semantically relevant yet still be harmful or unnecessary at the current decision point (Chen et al., 30 May 2026). In its formulation, the policy
6
acts over both ordinary actions and skill invocations, using only visible skill metadata 7 until invocation time (Chen et al., 30 May 2026). Training combines episode-level preference pairs with local invoke/skip counterfactuals 8, constructed from shared prefixes and labeled by downstream success and efficiency (Chen et al., 30 May 2026). This reframes context-to-skill not merely as routing toward a skill, but as a context-to-intervention decision.
SkillS is related but narrower. It is a state-conditioned scheduler over pretrained temporally extended skills in reinforcement learning,
9
where the scheduler selects a skill index and duration based on current state, while the final policy is learned separately from replay data (Vezzani et al., 2022). The paper is explicit that this is not an explicit context-embedding approach; “context” remains implicit in state and reward structure rather than in a separate learned task variable (Vezzani et al., 2022). It is therefore best viewed as a neighboring approach to context-conditioned sequencing rather than a full Ctx2Skill model.
ParametricSkills pushes execution control into weight space. It supports multi-skill composition via weighted low-rank updates
0
implemented with rank concatenation and norm-based calibration before merging (Zhao et al., 29 Jun 2026). It also introduces a preliminary continual-learning mechanism with an accumulated global adapter 1, updated by EMA after successful tasks (Zhao et al., 29 Jun 2026). This indicates that once context has been compiled into parameters, retrieval, composition, and accumulation can all be performed in adapter space rather than in prompt space.
5. Empirical evidence across domains
Ctx2Skill itself is evaluated on CL-bench, a benchmark with 500 contexts, 1,899 tasks, and 31,607 rubrics across Domain Knowledge Reasoning, Rule System Application, Procedural Task Execution, and Empirical Discovery & Simulation (Si et al., 30 Apr 2026). Base solving rates are low even for frontier models, and the framework improves all three tested backbones: GPT-4.1 from 11.1% to 16.5%, GPT-5.1 from 21.1% to 25.8%, and GPT-5.2 from 18.2% to 21.4% (Si et al., 30 Apr 2026). The gains are especially pronounced on Procedural Task Execution and Empirical Discovery & Simulation, and removing Cross-Time Replay causes substantial degradation, including a drop from 16.5 to 14.7 on GPT-4.1 (Si et al., 30 Apr 2026).
ParametricSkills provides a complementary result in software engineering. On six SWE subtasks, it achieves average LLM-judge score 64.09 versus 57.65 for in-context skill prompting, a gain of 6.44 points, while also improving BERT and F1 on average by +1.17 and +5.53 percentage points (Zhao et al., 29 Jun 2026). In HumanEval self-evolving evaluation, a poor initial generated skill can hurt, but iterative refinement with verification lifts performance from 75.00% for base parametric skill to 84.76%, surpassing the 81.09% no-skill baseline (Zhao et al., 29 Jun 2026). In an online continual experiment on a HumanEval subset, online continual merge reaches 16/31 versus 12/31 for independent self-evolution and 9/31 for the base model (Zhao et al., 29 Jun 2026).
SkillTTA shows that adaptive skill synthesis can outperform static skill libraries while keeping the solver fixed. On SpreadsheetBench, task-specific skills improve Pass@1 from 0.397 to 0.505 over static trajectory-to-skill synthesis with GPT-5.5; on BigCodeBench, Pass@1 improves from 0.517 to 0.651; and on ALFWorld, SkillTTA reaches 0.872–0.879 success while producing the shortest successful trajectories among reported methods, within roughly four points of MemRL’s success rate (Wang et al., 16 May 2026). Trace2Skill, in a different setting, reports that skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions, supporting the claim that trajectory-grounded skills can transfer across model scales (Ni et al., 26 Mar 2026).
Several systems focus on control and maintenance rather than skill synthesis alone. SelSkill improves task success on ALFWorld by 10.9 percentage points and execution precision by 29.1 points with Qwen3-8B, and on BFCL improves success by 5.7 points and execution precision by 29.5 points (Chen et al., 30 May 2026). D2Skill reports success-rate gains of 10–20 points over skill-free baselines on ALFWorld and WebShop, with ablations showing that both dual-granularity skill modeling and dynamic skill maintenance are critical (Tu et al., 30 Mar 2026). XSkill consistently improves over tool-only and learning-based baselines across five multimodal benchmarks and shows that skills and experiences influence agent behavior differently: skills reduce syntax and tool-name errors, while experiences shift tool usage toward more targeted strategies (Jiang et al., 12 Mar 2026). Anything2Skill, finally, reports that combining compiled skills with RAG yields 98.85% success on qsv and 94.10% on GitHub-CLI, substantially outperforming RAG-only agents (Pan et al., 8 Jun 2026).
Taken together, these results indicate that the Ctx2Skill pattern is not confined to one benchmark family. It appears in context learning, software engineering, multimodal agents, long-context EDA, reinforcement learning, and command-line task execution. The common empirical pattern is that explicitly compiled procedures outperform purely declarative retrieval or static prompting when downstream tasks require repeated, structured action.
6. Limitations, misconceptions, and open problems
A common misconception is that Ctx2Skill simply means “retrieve a helpful prompt.” The literature is broader. Skills may be short Markdown guides (Si et al., 30 Apr 2026), persistent folder-like artifacts with scripts and references (Ni et al., 26 Mar 2026), structured contracts in a SkillBank (Pan et al., 8 Jun 2026), or test-time generated LoRA adapters (Zhao et al., 29 Jun 2026). Another misconception is that relevance is sufficient for invocation. SelSkill directly disputes this by showing that a relevant skill may still be harmful if invoked before preconditions are met or when current parametric knowledge already suffices (Chen et al., 30 May 2026).
Another subtle point concerns what skill compilation does not solve. ParametricSkills eliminates the need to reread skill text during execution, but not the need to read task context itself; the method trades token-time retrieval and reasoning over the skill text for one-shot parameter synthesis (Zhao et al., 29 Jun 2026). Similarly, Trace2Skill and SkillEvolver externalize experience into reusable artifacts, but they do not remove the need for environment interaction, deployment-time evaluation, or audit mechanisms (Ni et al., 26 Mar 2026, Zhang et al., 11 May 2026).
Several technical limitations recur. Retrieval quality remains critical in context-conditioned synthesis; SkillTTA shows that random or overly large retrieval sets hurt, and that top-3 retrieval is better than top-5 or top-9 in its setting (Wang et al., 16 May 2026). Context specificity is also unresolved: Ctx2Skill’s own skills are primarily designed to generalize across tasks over the same context rather than across arbitrary contexts (Si et al., 30 Apr 2026), and XSkill stores a single evolving global skill document whose scalability to open-ended lifelong settings is left open (Jiang et al., 12 Mar 2026). In weight-space systems, compression bottlenecks and semantic conflict remain. ParametricSkills notes limited adapter capacity, possible poor compression of rich or inconsistent skill documents, and unresolved interference in multi-skill merging despite Frobenius-norm normalization (Zhao et al., 29 Jun 2026).
Maintenance and attribution introduce additional cost. SkillAdaptor requires localization, attribution, rewriting, and qualification by re-execution; its gains collapse when either the Localizer/Linker or the qualifier gate is removed (Yu et al., 31 May 2026). SkillEvolver depends on repeated deployment to fresh agents and on an audit that can detect leakage, under-abstraction, and silent-bypass failure modes, where a skill appears plausible in content but is never invoked at runtime (Zhang et al., 11 May 2026). D2Skill’s results likewise show that unmanaged skill accumulation is harmful: removing skill management drops validation success from 72.7 to 57.8 in one ALFWorld setting (Tu et al., 30 Mar 2026).
Open problems are therefore structural rather than merely incremental. The field still lacks a unified treatment of skill conflict, long-horizon lifecycle management, cross-context transfer, direct multimodal indexing for skill retrieval, and rigorous compositional generalization benchmarks. The evidence nevertheless suggests a stable conceptual conclusion: the most effective Ctx2Skill systems do not merely summarize context. They convert it into reusable procedure representations, maintain those representations over time, and couple them to downstream behavior through retrieval, adaptation, invocation control, or parameter synthesis (Si et al., 30 Apr 2026).