In-Context Policy Adaptation

Updated 20 December 2025

In-context policy adaptation is a dynamic technique enabling immediate behavioral adjustments through contextual cues like episodic history or system events without updating underlying model parameters.
It leverages non-parametric adaptation by using auxiliary inputs to drive rapid improvements, avoiding traditional gradient-based fine-tuning.
The approach is applicable in reinforcement learning, robotics, and multi-agent systems, ensuring real-time, sample-efficient responses to changing environments.

In-context policy adaptation refers to techniques enabling an agent, system, or policy controller to adapt its behavior dynamically in response to contextual information—whether this context is the episodic interaction history, environmental features, peer strategies, user demonstrations, or domain descriptors—without explicit parameter updates to the underlying policy model. Rather than relying on gradient-based fine-tuning, agents or systems immediately exploit the increasing context seen at inference time (via prompts, buffers, memory, or dynamic input encoding) to adjust their policy, control routine, or reasoning chain. The defining characteristic is that policy adaptation proceeds “in context” using external signal, auxiliary features, or examples, not by weight changes.

1. Core Principles and Formal Foundations

In-context policy adaptation is characterized by three salient theoretical properties:

Contextual Inputs: The policy (or reasoning agent) receives as input not only the current observation $s_t$ but also an auxiliary context $C_t$ that may include histories, peer rollouts, demonstration buffers, or other informative signals.
Non-parametric Adaptation: The underlying policy parameters $\theta$ remain fixed during deployment/inference, leveraging context $C_t$ to drive adaptation; no post-training gradient steps, online fine-tuning, or architectural changes are made after pretraining.
Immediate Responsiveness: Policy adaptation is triggered instantaneously by changes in context—such as new user preferences, teammate policy shifts, environmental non-stationarities, or demonstration examples—without explicit retraining cost. Typical input includes full history $H_t$ (as in in-context RL (Moeini et al., 29 Sep 2025)), demonstration buffers (as in ICPI (Brooks et al., 2022, Merwe et al., 20 Aug 2025)), system events (as in VO reconfiguration (Reiff-Marganiec, 2012)), or contextual tags (as in policy-free middleware (Dearle et al., 2010)).

Formally, a generic in-context policy takes the structure

$\pi(a_t \mid s_t, C_t; \theta),$

with $\theta$ frozen at deployment, where $C_t$ is a potentially unbounded context variable (trajectory, peer encoding, prompt window, etc.), and policy improvement occurs exclusively via updates to $C_t$ .

2. Methodological Taxonomy

Multiple algorithmic frameworks realize in-context policy adaptation, distinguished by their operating paradigm and application scope:

Transformer-based in-context RL: Agents use growing histories in their context window, with adaptation manifested as improved performance as more context is observed, e.g., the EPPO approach for safe in-context RL (Moeini et al., 29 Sep 2025). The policy architecture admits both reward maximization and constraint satisfaction solely via history and cost-to-go signal inputs.
Skill- and Adapter-based architectures: Approaches such as Decision Adapter networks (Beukman et al., 2023) and cross-domain skill diffusion (Yoo et al., 4 Sep 2025) use explicit context encodings (e.g., physics parameters, domain descriptors, or demonstration sets) to parameterize or modulate subnetwork weights on the fly via hypernetworks or diffusion adapters, achieving zero-shot generalization to novel domains without model updates.
Prompt-/buffer-centric few-shot RL and imitation learning: In-Context Policy Iteration (ICPI) (Brooks et al., 2022), Instant Policy via graph diffusion (Vosylius et al., 19 Nov 2024), and iterative LLM-based manipulation policy improvement (Merwe et al., 20 Aug 2025) all represent policies or policy updates “in the prompt” as example buffers, using foundation models to run evaluation and improvement in simulation or real environments by accumulating, swapping, or conditioning on past trials.
Contextually-programmed system policies: In highly dynamic distributed systems, runtime policy adaptation may be expressed by Event-Condition-Action (ECA) policies (Reiff-Marganiec, 2012) or policy-free middleware (Dearle et al., 2010), where the context comprises incoming events, context field values, or runtime observations. Policy logic is controlled by declarative statements, pattern-matchers, or dynamic rule trees rather than hard-coded behaviors.
Peer/context-aware exploration: In multi-agent learning, context encodes interaction histories with peers or opponents (as in PACE (Ma et al., 4 Feb 2024) and Fastap (Zhang et al., 2023)), and the agent is rewarded for identifying peer identity and adapts its policy accordingly, often by maximizing mutual information between the latent context and hidden peer identities.
Online preference alignment in LMs: Large LMs adapt dynamically to user-specific preferences by using online context accumulation, as in Preference Pretrained Transformer (Lau et al., 17 Oct 2024), where user history is leveraged for rapid in-context reward optimization in a contextual bandit setting.

A summary of representative methodological archetypes:

Approach	Context Source	Adaptation Mechanism
Transformer ICRL (Moeini et al., 29 Sep 2025)	Trajectory buffer	Gated, history-dependent inputs
Decision Adapter (Beukman et al., 2023)	Domain parameters	Hypernet-generated adapters
Skill diffusion (Yoo et al., 4 Sep 2025)	Demos, domain code	Diffusion adapter + dynamic prompting
Policy iteration ICPI (Brooks et al., 2022)	Trajectory buffer	Prompt-based evaluation and rollout
PACE/Fastap (Ma et al., 4 Feb 2024, Zhang et al., 2023)	Peer interaction	Context encoders + MI/CL objectives
POISE (Kang et al., 2019)	Device state/event	Hardware state registers + match/action
Policy-free MW (Dearle et al., 2010)	System fields	Pattern-matched policy objects

3. Architectural and Formal Models

Modern in-context policy adaptation architectures combine feature encoders, context aggregators, and dynamic modules:

Context Encoders: Convert variable-length or structured context (e.g., sequence $(s_1,a_1,...)$ ; physics parameters; peer IDs) to fixed-dimensional embeddings. Approaches range from MLP- or transformer-based aggregations (Ma et al., 4 Feb 2024, Beukman et al., 2023) to specialized domain encoders using contrastive loss (Yoo et al., 4 Sep 2025).
Adapter/Hypernetwork Layers: Generate weights for per-layer adapters inside the policy network, conditional on context, to restructure the computation path dynamically for new dynamics or domain shifts (Beukman et al., 2023).
Graph-based Representations: For partially observable or object-centric manipulation, context is encoded as a heterogeneous graph spanning demonstrations, current state, and action candidates, with policy prediction as a conditional diffusion or generative process (Vosylius et al., 19 Nov 2024).
Runtime Policy Managers: In system contexts, runtime engines maintain context snapshots and stream events to a policy manager that evaluates ECA rules or matches patterns to select and enact policy modules (Dearle et al., 2010, Reiff-Marganiec, 2012).
Prompt Conditioning: In foundation model-driven approaches, prompts serve as containers for context—either as raw histories, few-shot buffers, or program-generated scenario sets—on which LLMs condition their rollouts, policy improvements, or world model rollbacks (Brooks et al., 2022, Merwe et al., 20 Aug 2025, Huang et al., 30 Oct 2025).

4. Empirical Results and Applications

Extensive empirical evidence supports the efficacy of in-context policy adaptation across domains:

Zero-shot and few-shot generalization: Decision Adapter architectures (Beukman et al., 2023) and diffusion skill adaptation (Yoo et al., 4 Sep 2025) show superior generalization to unseen parameters, domain shifts, or physical dynamics, with minimal or no loss of reward relative to full retraining baselines.
Real-world manipulation and robotics: In-Context Iterative Policy Improvement (Merwe et al., 20 Aug 2025) and Instant Policy (Vosylius et al., 19 Nov 2024) achieve sample-efficient adaptation in high-dimensional, underactuated manipulation and real-robot settings, with low iteration counts and rapid convergence in the absence of parameter updates.
Multi-agent and non-stationary settings: PACE (Ma et al., 4 Feb 2024) and Fastap (Zhang et al., 2023) demonstrate rapid in-context identification of hidden peer strategies or team policy clusters, leading to pronounced robustness under sudden non-stationarity, outperforming classical latent-policy or recurrent-policy baselines in both stationary and abrupt change regimes.
Online preference and task alignment: In contextual bandits and RLHF, preference adaptation via history-dependent policies (as in PPT (Lau et al., 17 Oct 2024)) enables scalable, computation-efficient personalization, suggesting clear improvements over “static” RLHF protocols.
Agentic policy document reasoning: For LLM agents integrating business rules and workflow policies, internalization techniques such as Category-Aware Policy Continued Pretraining (CAP-CPT) (Liu et al., 13 Oct 2025) dramatically reduce required context (up to 97.3% compression) while preserving or exceeding performance on highly complex, nested conditional specifications.

5. System Design, Implementation, and Scalability

Effectively applying in-context policy adaptation in large-scale or safety-critical systems requires attention to system architecture, context representation, and runtime guarantees:

Policy-Free Middleware: Middleware implementations (e.g., RAFDA (Dearle et al., 2010)) decouple adaptation logic from core protocol code, supporting hot-swappable policy objects and declarative pattern languages, with matching decision trees supporting sub-10 μs lookup for thousands of dynamic rules.
Runtime Agility: Hardware-enforced policies in programmable networks (e.g., Poise (Kang et al., 2019)) yield sub-microsecond policy update latencies and resilience to control-plane overload, illustrating the feasibility of in-context adaptation in strict real-time environments.
Robustness and conflict resolution: Structured policy frameworks (APPEL/StPowla (Reiff-Marganiec, 2012)) resolve conflicting adaptations using explicit priority/precedence, while formal action-semantics guarantee safe and compositional structural updates to organizational or workflow entities.
Scalability Limits: Memory-constrained contexts (e.g., LLM prompt windows, finite event histories, or limited demonstration batches) necessitate careful design of context storage, summarization, and retrieval mechanisms; e.g., dynamic domain prompting (Yoo et al., 4 Sep 2025) and context clear or swap heuristics (Ma et al., 4 Feb 2024).

6. Limitations, Open Challenges, and Future Directions

While in-context policy adaptation demonstrates strong empirical and theoretical benefits, several practical and conceptual challenges remain:

Conditional Generalization: Performance may degrade if the context encoder overfits, relevant context variables are omitted, or test-time contexts are out-of-distribution relative to pretraining (Beukman et al., 2023, Moeini et al., 29 Sep 2025).
Prompt/Buffer Scaling in LLMs: The context window for prompt-based policy iteration is bounded by the model’s token capacity, and performance is sensitive to prompt format, sampling, and balancing strategy (Brooks et al., 2022, Merwe et al., 20 Aug 2025).
Robustness to Distractors and Irrelevant Contexts: Decision Adapter and similar architectures show increased robustness to irrelevant context dimensions, but extreme distractor presence or spurious variables can still degrade naive encoding schemes (Beukman et al., 2023).
Complex Reasoning Over Policy Documents: For LLM-agentic policy adaptation, policy internalization hinges on systematic categorization of rules and scenario-simulation for complex workflow logic. Failure to balance broad continued pretraining with high-quality supervised alignment can lead to forgetting or brittle performance on rare branches (Liu et al., 13 Oct 2025).
Dynamic Peer Adaptation: While mutual information-based peer identification yields faster adaptation in multi-agent games (Ma et al., 4 Feb 2024), the diversity of the peer policy pool and length/structure of context can bottleneck ultimate adaptation speed and outcome.

Future research directions include hierarchical or multi-modal context representations, more expressive/reasoned context compression and retrieval, formal safety and correctness guarantees, deeper integration with agentic workflow policy documents, and the extension of these mechanisms to joint multi-agent and human-in-the-loop deployments.

7. Representative Case Studies

To illustrate the diversity and depth of in-context policy adaptation, the following case studies highlight core mechanisms:

Safe In-Context Reinforcement Learning: An agent, pre-trained on center-oriented tasks, generalizes to edge-oriented out-of-distribution CMDPs by conditioning solely on history and scalar cost-to-go, fulfilling strict per-episode constraints without any online parameter update (Moeini et al., 29 Sep 2025).
Cross-Domain Adaptive Diffusion: ICPAD applies diffusion-based adapters conditioned on dynamically attended few-shot prompt sets, enabling immediate alignment to novel dynamics in long-horizon tasks (robotics, autonomous driving) with no target-domain retraining (Yoo et al., 4 Sep 2025).
Instant Policy for Imitation: Heterogeneous graph diffusion enables robots to perform new manipulation skills from one or two demonstrations by representing both context and action as nodes and edges, leveraging pseudo-demo simulation for unlimited offline pretraining (Vosylius et al., 19 Nov 2024).
ICPO for Mathematical Reasoning LLMs: Policy optimization incorporates off-policy rollouts generated via the model’s own few-shot in-context learning, expanding exploration beyond the current policy’s support and outperforming both vanilla GRPO and external-expert mixed methods on advanced mathematical tasks (Huang et al., 30 Oct 2025).
Middleware/Organizational Reconfiguration: Policy-free middleware and ECA-driven VO reconfiguration engines realize live, policy-driven adaptation of distributed applications and virtual organizations by declarative, runtime-evaluated context and action specifications (Reiff-Marganiec, 2012, Dearle et al., 2010).

In-context policy adaptation now represents a unifying paradigm bridging RL, multi-agent learning, foundation model reasoning, distributed systems, and programmable infrastructure. It is distinguished by immediate, non-parametric responsiveness to arbitrary context, decoupling learning from inference, and an empirical track record of sample efficiency, safety, robustness, and system-level compositionality across academic and applied domains (Moeini et al., 29 Sep 2025, Beukman et al., 2023, Vosylius et al., 19 Nov 2024, Ma et al., 4 Feb 2024, Liu et al., 13 Oct 2025, Yoo et al., 4 Sep 2025, Dearle et al., 2010, Reiff-Marganiec, 2012, Kang et al., 2019, Brooks et al., 2022, Merwe et al., 20 Aug 2025, Huang et al., 30 Oct 2025, Zhang et al., 2023, Lau et al., 17 Oct 2024).