Papers
Topics
Authors
Recent
Search
2000 character limit reached

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Published 16 Jan 2026 in cs.LG, cs.AI, and cs.CL | (2601.11258v1)

Abstract: LLMs face the "knowledge cutoff" challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update model knowledge, it often updates factual content without reliably improving the model's ability to use the newly incorporated information for question answering or decision-making. Reinforcement Learning (RL) is essential for acquiring reasoning skills; however, its high computational cost makes it impractical for efficient online adaptation. We empirically observe that the parameter updates induced by SFT and RL are nearly orthogonal. Based on this observation, we propose Parametric Skill Transfer (PaST), a framework that supports modular skill transfer for efficient and effective knowledge adaptation. By extracting a domain-agnostic Skill Vector from a source domain, we can linearly inject knowledge manipulation skills into a target model after it has undergone lightweight SFT on new data. Experiments on knowledge-incorporation QA (SQuAD, LooGLE) and agentic tool-use benchmarks (ToolBench) demonstrate the effectiveness of our method. On SQuAD, PaST outperforms the state-of-the-art self-editing SFT baseline by up to 9.9 points. PaST further scales to long-context QA on LooGLE with an 8.0-point absolute accuracy gain, and improves zero-shot ToolBench success rates by +10.3 points on average with consistent gains across tool categories, indicating strong scalability and cross-domain transferability of the Skill Vector.

Summary

  • The paper shows that SFT and RL updates occur in nearly orthogonal subspaces, justifying the modular extraction of reasoning skills.
  • It introduces the PaST framework which linearly injects RL-derived skill vectors into SFT-trained models for improved performance.
  • Empirical results on SQuAD, LooGLE, and ToolBench demonstrate significant accuracy gains and enhanced cross-domain generalization.

Injecting RL-Derived Reasoning Skills for Efficient LLM Knowledge Adaptation

Introduction and Motivation

The paper "Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation" (2601.11258) addresses the problem of functional disconnect between knowledge and reasoning capacity in LLMs during continual adaptation. Standard Supervised Fine-Tuning (SFT) updates models to incorporate new factual data but results in a brittle system unable to robustly manipulate or act on newly acquired information. Reinforcement Learning (RL) confers the necessary reasoning and decision-making abilities but remains impractically costly for online adaptation in every new context. The authors argue that knowledge acquisition (SFT) alone is insufficient for operational proficiency, positioning RL-driven skill injection as essential for robust downstream performance. Figure 1

Figure 1: Overview of Parametric Skill Transfer (PaST) illustrating how SFT fails to handle errors while PaST injects RL-derived reasoning skills; parameter updates for knowledge (ΔWSFT\Delta W_{SFT}) and skill (ΔWRL\Delta W_{RL}) reside in orthogonal subspaces.

Empirical Discovery: Orthogonality of SFT and RL Parameter Updates

The central technical finding is that SFT and RL induce nearly orthogonal parameter updates across model layers, as visualized via layer-wise cosine similarity analysis. This orthogonality indicates that knowledge (acquired via SFT) and procedural manipulation logic (acquired via RL) inhabit statistically disentangled subspaces in the parameter landscape. The practical implication: RL-learned skills can be extracted as a domain-agnostic vector and linearly injected into a model updated with SFT—enabling modular reasoning transfer without costly target-domain RL. Figure 2

Figure 2: Layer-wise cosine similarity between SFT- and RL-induced weight changes on LooGLE demonstrates dominant near-zero values, evidencing orthogonality.

The authors provide theoretical justification grounded in high-dimensional statistics: orthogonality at the parameter level assures signal-level functional disentanglement through input-layer norm and concentration-of-measure arguments, ensuring that injected skill components do not destructively interfere with base knowledge.

Parametric Skill Transfer (PaST) Framework

PaST is a modular two-stage procedure:

  1. Skill Extraction: In a source domain, SFT adapts the base model to new documents. RL subsequently induces reasoning proficiency via trajectory-based interaction. The skill vector vskill=θSrl−θSsftv_{skill} = \theta_S^{rl} - \theta_S^{sft} captures RL-induced procedural policy, isolated from domain-specific facts.
  2. Target Domain Adaptation: Lightweight SFT internalizes target domain knowledge. The skill vector is then injected via θfinal=θTsft+λvskill\theta_{final} = \theta_T^{sft} + \lambda v_{skill}, linearly integrating robust skill to enable zero-shot reasoning and tool-use in the target domain without RL-phase cost.

The framework is extensible through iterative skill refinement, partitioning source data and repeatedly distilling the skill vector for increased domain invariance and generalization.

Experimental Evaluation

Knowledge-Based QA: SQuAD and LooGLE

SQuAD

On closed-book SQuAD, PaST achieves decisive absolute gains in mean accuracy, outperforming both SFT+synthetic baselines and the state-of-the-art SEAL meta-training method by up to +9.9 points. Injection of the skill vector on top of already-strong SFT+synthetic configurations yields further improvement (+17.2%), supporting the claim that skill transfer—not just improved data generation—is the bottleneck in practical knowledge adaptation.

LooGLE

In long-context scenarios (LooGLE, average context length >21k tokens), skill vector injection delivers an accuracy gain of +8.0% over strong SFT baselines, demonstrating scalable benefit for real-world adaptation tasks characterized by massive memory and complex retrieval logic.

Cross-Domain Agentic Tool Use: ToolBench

In StableToolBench zero-shot transfer, PaST substantially improves average success rates from 21.9% (Target SFT baseline) to 32.2%, a +10.3% absolute increase. Crucially, it achieves success in categories where SFT-only adaptation fails completely (e.g., Advertising, SMS), and consistently outperforms the baseline across all 20 RL-unseen categories—establishing robust cross-domain generalization from a skill vector distilled solely in the Movies domain. Figure 3

Figure 3: Zero-shot cross-domain generalization in ToolBench, PaST increases average success rate by +10.3% on RL-unseen categories.

Ablations: Iterative Refinement and Vector Injection Timing

Iterative skill refinement yields superior results over single-round training, validating the hypothesis that reasoning logic must be repeatedly purged of content-specific patterns to achieve domain-agnostic procedural generality.

Alternative transfer strategies—sequential fine-tuning or pre-injection before SFT—result in optimization conflicts or misalignment with the knowledge base, whereas post-hoc composition ensures maximal synergy between factual anchoring and skill grafting. Figure 4

Figure 4: Similarity between two SFT updates reveals strong positive correlation, confirming only SFT-RL update orthogonality as functionally unique.

Control Experiments: Nontriviality of Orthogonality

Direct comparison of two independent SFT steps displays strong positive update correlation, confirming that shared-task adaptation occupies aligned subspaces. The pronounced orthogonality between SFT and RL updates is thus not a trivial outcome of parameter dimensionality, but a consequence of the fundamental mechanistic divide between memorization (SFT) and skill generalization (RL).

Practical and Theoretical Implications

From a practical perspective, PaST enables efficient, scalable adaptation to rapidly-changing domains without incurring the prohibitive cost of RL rollouts for every update. It supports robust knowledge manipulation and tool orchestration in previously unseen scenarios, empowering LLM-based agents for continual deployment. Theoretically, this work substantiates a model of modular cognitive transfer, where parametric skill vectors operate as reusable reasoning circuits decoupled from factual memory—potentially inspiring new architectures for continual learning, modular agent design, and efficient skill transfer across tasks and models.

Future research may explore auto-tuned scaling coefficients λ\lambda, broader architectural generalization, and recursive transfer across multi-hop knowledge/skill manifolds. Questions remain as to the universality and potential for hierarchical composition of skill vectors across base models and larger scales.

Conclusion

The paper establishes a strong empirical and theoretical foundation for modular reasoning transfer in LLMs via PaST, demonstrating that RL-derived skills and SFT-acquired knowledge inhabit decomposable, orthogonal subspaces. By distilling and injecting domain-invariant RL skill vectors, models achieve robust, efficient adaptation—raising both practical performance and conceptual understanding of continual learning in agentic language systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 15 tweets with 29 likes about this paper.