SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
This presentation explores SKILL0, a breakthrough framework that transforms how language model agents acquire and use skills. Rather than relying on external skill descriptions at runtime, SKILL0 internalizes procedural knowledge directly into model parameters through reinforcement learning and adaptive curriculum scheduling. The system achieves superior performance on complex reasoning benchmarks while dramatically reducing token overhead, enabling truly autonomous zero-shot agent behavior without runtime skill retrieval.Script
Current language model agents are like students who bring their textbooks to every exam. They retrieve skill descriptions at every decision point, burning tokens and staying dependent on external guidance. What if agents could internalize skills into their parameters and act autonomously?
The standard approach injects skill descriptions into context at runtime. This creates three critical bottlenecks: noisy retrieval pollutes the decision space, token costs explode with task complexity, and agents never truly learn. They remain dependent, unable to function when skills are removed.
SKILL0 flips this paradigm entirely.
The framework operates in three coordinated stages. First, skills are grouped offline by relevance to validation tasks. Then, an in-context reinforcement learning loop trains the agent with skills rendered as compressed visual context, drastically cutting token usage. Most crucially, a dynamic curriculum evaluates each skill's on-policy helpfulness and removes only those the agent has internalized, ensuring stable learning as skill context gradually vanishes.
Training dynamics reveal true internalization. With skills present, the agent learns quickly. As the curriculum removes helpful skills, skill-free validation performance catches up and stabilizes. The policy has absorbed procedural knowledge. On Qwen 2.5 vision language 3 billion, SKILL0 improves success rates by nearly 10 points on ALFWorld while using half a kilotoken per step, compared to text-prompt baselines that burn over 3 kilotoken.
On the 7 billion parameter backbone, SKILL0 reaches 89.8 on ALFWorld and 44.4 on Search Q A, outperforming memory-augmented and search-based agents. It generalizes strongly to compositional and multi-hop reasoning without ever seeing those skills at inference. This proves skill internalization works: agents can assimilate complex procedural knowledge and execute zero-shot without runtime prompts.
SKILL0 transforms agents from prompt-dependent tools into autonomous reasoners who carry their skills within. Visit EmergentMind.com to explore the full paper and create your own research video.