From Context to Skills: Can Language Models Learn from Context Skillfully?

Published 30 Apr 2026 in cs.AI | (2604.27660v2)

Abstract: Many real-world tasks require LMs to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills. However, constructing such skills for context learning scenarios faces two challenges: the prohibitive cost of manual skill annotation for long, technically dense contexts, and the lack of external feedback for automated skill construction. In this paper, we propose Ctx2Skill, a self-evolving framework that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. At its core, a multi-agent self-play loop has a Challenger that generates probing tasks and rubrics, a Reasoner that attempts to solve them guided by an evolving skill set, and a neutral Judge that provides binary feedback. Crucially, both the Challenger and the Reasoner evolve through accumulated skills: dedicated Proposer and Generator agents analyze failure cases and synthesize them into targeted skill updates for both sides, enabling automated skill discovery and refinement. To prevent adversarial collapse caused by increasingly extreme task generation and over-specialized skill accumulation, we further introduce a Cross-time Replay mechanism that identifies the skill set achieving the best balance across representative cases for the Reasoner side, ensuring robust and generalizable skill evolution. The resulting skills can be plugged into any LLM to obtain better context learning capability. Evaluated on four context learning tasks from CL-bench, Ctx2Skill consistently improves solving rates across backbone models.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces Ctx2Skill, a framework that automatically extracts modular skills from context, achieving up to 5.4% improvements on CL-bench tasks.
It employs a self-evolving multi-agent system and a Cross-Time Replay mechanism to mitigate adversarial collapse and optimize skill generalization.
Experimental results demonstrate robust gains in complex reasoning tasks with effective transferability across different language model backbones.

Ctx2Skill: Autonomous Discovery of Context-Specific Skills for LLMs

Motivation and Problem Formulation

Ctx2Skill addresses a critical gap in current LLM capabilities: robust context learning in scenarios where domain-specific, complex contexts must be used for reasoning, rather than relying on parametric knowledge acquired during pretraining. In realistic use cases—such as interpreting technical documentation, scientific data, or extensive legal texts—current LMs struggle to extract, synthesize, and operationalize procedural rules or latent knowledge from context alone. Previously, most skill augmentation approaches have relied on costly manual annotation or external feedback (such as programmatic task success signals). These are infeasible for dense, specialized contexts and have limited utility for black-box or closed LMs that cannot be further fine-tuned.

The Ctx2Skill framework redefines skill acquisition as an automated, feedback-free process, leveraging only the context itself to produce reusable natural language skills—modular, explicit procedural rules or strategies—packaged for injection at inference time. This approach aims to circumvent prohibitive annotation requirements and the absence of external evaluative signals ubiquitous in document-centric tasks.

Figure 1: Ctx2Skill autonomously extracts skills in natural language from complex context, requiring neither annotation nor feedback signals.

Framework Architecture

At its core, Ctx2Skill implements a self-evolving multi-agent system featuring three primary roles:

Challenger: Generates probing tasks and associated rubrics from the context and its current skill knowledge, seeking to expose overlooked or poorly captured context-specific knowledge.
Reasoner: Attempts to solve these tasks using the same context plus its current skill set, which is iteratively distilled and augmented.
Judge: Provides fine-grained binary feedback per rubric but uses no gold standard or additional external signals.

Failure-driven skill updates are coordinated via two additional sub-agents—Proposer (diagnosis) and Generator (update synthesis)—on both the Challenger and Reasoner sides. Importantly, both Challenger and Reasoner evolve through updates to their skill documents rather than model parameters, maintaining compatibility with closed-source or API-based LMs.

A unique challenge uncovered is adversarial collapse: as iterations proceed, the Challenger may hyper-specialize task generation around remaining Reasoner weaknesses, causing overfitting and diminishing broader generalization. Ctx2Skill mitigates this through a Cross-Time Replay mechanism, re-evaluating skill sets from each iteration on representative probe tasks to select the maximally generalizable configuration.

Figure 2: Schematic of Ctx2Skill: (a) self-play cycle for skill evolution, (b) Cross-Time Replay selects the optimal skill set per context for downstream robustness.

Experimental Validation

Evaluation is conducted on the CL-bench benchmark, encompassing 500 contexts and 1,899 diverse tasks across four categories: Domain Knowledge Reasoning, Rule System Application, Procedural Task Execution, and Empirical Discovery Simulation. Each task strictly requires that all specified rubrics be satisfied for a 'solved' outcome.

Ctx2Skill consistently delivers substantial improvements independent of the LM backbone, using only inference-time skill augmentation. Notable results include:

GPT-4.1: Solving rate increases from 11.1% to 16.5% (+5.4%), surpassing stronger models without skill augmentation.
GPT-5.1: Solving rate rises from 21.1% to 25.8% (+4.7%).
GPT-5.2: Solving rate grows from 18.2% to 21.4% (+3.2%).

Gains are most significant for categories requiring complex, inductive procedures and multi-step context composition. Ablation studies strongly support the necessity of adversarial co-evolution (both Challenger and Reasoner skill updates) and the Cross-Time Replay mechanism. Direct comparison with baseline single-pass skill extraction (Prompting) and prior automated techniques (AutoSkill4Doc) shows that Ctx2Skill yields superior skill quality in terms of conciseness, clarity, faithfulness, and reusability, with increases of 2–3 points in multi-dimensional GPT-based scoring.

Transfer experiments corroborate that skills discovered by stronger LMs can be leveraged by weaker LMs, but the converse transfer is less effective, highlighting backbone-dependent extraction limits.

Figure 3: Ctx2Skill achieves higher solving rates across nearly all CL-bench sub-categories compared to the unskilled base model.

Theoretical and Practical Implications

Ctx2Skill’s paradigm constitutes an explicit, modular approach for enhancing LM context learning capacity—crucially, without parameter updates or external labels. This is particularly relevant for regulated environments, closed-source commercial LMs, and scenarios where context-Shots or in-context pattern induction are insufficient due to the knowledge complexity, domain novelty, or structure of the context.

From a theoretical perspective, Ctx2Skill demonstrates that multi-agent adversarial self-play—previously mainly used for policy distillation or RL in interactive environments—can extend to the purely textual context learning setting when paired with carefully constructed cross-time selection and iterative textual feedback. The explicit natural language skills generated are also inherently interpretable and auditable, supporting downstream explainability and skill-oriented audit processes.

For practical applications, Ctx2Skill offers a scalable, annotation-free pipeline for generating domain-adapted procedural abstractions, which can be integrated into LM deployments to: (i) improve accuracy on document-centric or knowledge-grounded tasks, (ii) reduce hallucination and knowledge-conflict errors, and (iii) provide a means for continual, context-driven evolution of model behavior with no retraining.

Future Directions

Future research directions emergent from Ctx2Skill’s architecture include:

Extending the self-play and replay selection logic to multi-modal or multi-document scenarios.
Joint or federated skill evolution across multiple related contexts to enable broader transfer and compositionality.
Integration with retrieval-augmented or tool-augmented LM pipelines for end-to-end context learning with dynamic external sources.
Formal exploration of connections to meta-learning, lifelong skill composition, and automated distillation of parametric knowledge into modular natural-language representations.

Conclusion

Ctx2Skill advances the state of the art for automated extraction and operationalization of context-dependent skills in LLMs. By fully decoupling skill acquisition from costly annotation and parameter tuning, and by incorporating adversarial and cross-time replay mechanisms, Ctx2Skill establishes a scalable, robust pipeline for context learning that is both practical and theoretically grounded. The empirical results demonstrate substantial improvements in complex reasoning, with strong transferability and interpretability properties, motivating further research into context-driven agent skill learning (2604.27660).

Markdown Report Issue