Zero-Shot Reinforcement Learning

Updated 15 January 2026

Zero-shot RL is an algorithmic paradigm that enables agents to instantly deploy competent policies on unseen tasks using reward-free or pseudo-reward pre-training.
Approaches range from direct representations with transformer encoders to compositional methods like successor features, offering robust zero-shot generalization.
This paradigm has significant implications for transfer learning, offline RL, and real-world applications, while addressing challenges like partial observability and data scarcity.

Zero-shot reinforcement learning (RL) is an algorithmic paradigm enabling agents to instantly deploy competent policies for unseen tasks within a known or unknown environment, following a reward-free pre-training phase. Distinct from traditional RL, which requires environment interaction and explicit fine-tuning per task, zero-shot RL prescribes learning representations usable for arbitrary downstream objectives without additional learning at test time, often leveraging reward-free or pseudo-reward-free exploration. This approach offers a pathway toward generalist agents capable of functioning across diverse domains, with significant implications for transfer learning, offline RL, and real-world deployment where simulator or data access is limited.

1. Formal Definitions and Canonical Problem Settings

Zero-shot RL generalizes the conventional RL objective from a single Markov Decision Process (MDP) $\mathcal{M}=(S,A,p,p_0,r,\gamma)$ to a family of MDPs $\mathcal{M}^R$ sharing $(S,A,p,p_0,\gamma)$ but differing over a reward space $R$ (Ventura et al., 23 Oct 2025). The agent pre-trains without knowledge of the test-time reward $r \sim D^{test}$ and, during inference, must immediately output a policy $\pi_r$ mapping $r$ to actions without fine-tuning, further planning, or gradient updates.

Key properties:

No additional learning at test-time: policies for unseen tasks are inferred solely via representations learned in pre-training (Touati et al., 2022).
Policies are functions of reward, context, or calibration data: $\pi_r(a|s)$ or $\pi(a|s,c_r)$ where $c_r$ encodes the task.
Success measured by performance on held-out tasks, with no adaptation.

Variants:

Reward-free zero-shot RL: No reward signals are used in offline data or representation learning.
Pseudo-reward-free methods: Agents sample pseudo-rewards from a train distribution and encode policies/values for anticipated rewards (Ventura et al., 23 Oct 2025).
Zero-shot generalization: The agent is evaluated not just on new reward functions but on novel context instantiations, environmental dynamics, and compositional tasks (Garcin et al., 2024, Ndir et al., 2024).

2. Representative Algorithmic Paradigms and Architectures

The field organizes methods along two main axes: representation structure (direct vs. compositional) and reward usage (reward-free vs. pseudo-reward-free) (Ventura et al., 23 Oct 2025).

Direct Representations

These models (e.g., UVFA, FRE) learn an explicit mapping from state, action, and reward/task contexts to value or policy functions, often using Transformer-based latent encoders or Hilbert-space embeddings (Frans et al., 2024, Ventura et al., 23 Oct 2025). Functional Reward Encoding (FRE) learns a transformer VAE to encode arbitrary reward functions given context samples $(s_i,r(s_i))$ into a latent $z$ , enabling the agent to select policies $\pi(a|s,z)$ for novel downstream tasks with only a forward pass (Frans et al., 2024). Hilbert representations encode reward structure via optimal temporal geometry (Ventura et al., 23 Oct 2025).

Compositional Representations

Methods exploit known value decompositions, notably successor features (SF), universal successor features (USF), and forward-backward (FB) representations. SF models decompose $Q_r^\pi(s,a) = \psi^\pi(s,a)^T w$ for reward $r(s)=\phi(s)^T w$ and learn $\psi^\pi$ reward-independently (Touati et al., 2022, Jeen et al., 2023). FB approaches construct two joint encoders for successor measures and elementary state features (Touati et al., 2022, Zheng et al., 17 Oct 2025). Proto Successor Measures (PSM) solve for policy-independent low-rank decompositions (Ventura et al., 23 Oct 2025).

Expressivity and robustness can be enhanced via attention-based architectures, diffusion models for multimodal policy extraction, and behavioral regularization anchoring policy distributions to dataset statistics (Zheng et al., 17 Oct 2025).

Operator Approaches

Operator deep Q-learning directly approximates the map from arbitrary reward functions to value functions using attention-weighted operator networks, generalizing Bellman’s fixed-point to a functional operator setting (Tang et al., 2022).

Task Parameterization and Function Encoding

Function encoders represent reward or transition functions via learned nonlinear basis expansions, efficiently mapping calibration datasets to low-dimensional task codes that slot into standard RL backbones without further training (Ingebrand et al., 2024).

3. Statistical and Theoretical Guarantees

Theoretical analyses have established guarantees in several regimes:

Linearity and orthonormality: If reward or transition functions lie in the linear span of training data and basis functions are (near)-orthonormal, task codes and interpolation of policies generalize smoothly—no gradient steps are needed at test time (Ingebrand et al., 2024).
Zero-shot bounds for SF/USF: Suboptimality in compositional methods is decomposed into linearity error (reward representation), inference coverage error (codebook density), and TD approximation error (Ventura et al., 23 Oct 2025).
Direct zero-shot optimization: The true zero-shot RL loss $\ell_\beta$ is analytically minimized for explicit priors, including white-noise, Dirichlet, and sparse goal rewards, linking to VISR and revealing specialization dynamics under dense Gaussian priors (Ollivier, 15 Feb 2025).
Non-collapse in TD-JEPA: Proper initialization and rapid predictor re-optimization preserve covariance and avoid representation collapse; MC and TD losses coincide with low-rank successor measure reconstruction (Bagatella et al., 1 Oct 2025).
Data-efficient and robust extensions: Behavioral regularization and expectile regression (IQL-style) mitigate extrapolation errors from out-of-distribution actions and stabilize policy extraction, especially in limited-data regimes (Zheng et al., 17 Oct 2025, Jeen et al., 2023).
Partial observability: Memory-based recurrent architectures (FB-M, GRU) recover shared guarantees by encoding temporal context, resolving state and task misidentification in POMDPs (Jeen et al., 18 Jun 2025).

4. Empirical Evaluation: Benchmarks, Limitations, Impact

Empirical results span classic continuous control (Walker, Cheetah, Quadruped), navigation (AntMaze), manipulation, high-dimensional humanoid motion, and domain-general gridworlds (Garcin et al., 2024, Tirinzoni et al., 15 Apr 2025, Ingebrand et al., 2024).

Function Encoders (FE) demonstrate state-of-the-art data efficiency and stability, outperforming Transformer baselines and matching or exceeding oracle policies in transition function inference (Half-Cheetah), runner identification (multi-agent tag), and multi-goal Pacman (Ingebrand et al., 2024).
FB and FB-M consistently reach 81–85% of supervised RL or oracle performance in zero-shot fashion given good replay buffers, with memory-based architectures closing the gap under partial observability (Touati et al., 2022, Jeen et al., 18 Jun 2025).
Behavioral Foundation Models (FB-CPR) enable whole-body humanoid control, expressing human-like behaviors across reward optimization, motion tracking, and goal-reaching without fine-tuning, judged competitive with (and sometimes preferred to) offline RL top-lines (Tirinzoni et al., 15 Apr 2025).
BREEZE achieves high robustness and expressivity, outperforming prior offline zero-shot RL and demonstrating resilience to limited data via attention-enhanced FB architectures and task-conditioned diffusion policies (Zheng et al., 17 Oct 2025).
FRE, Operator Deep Q-Learning, Hypernetworks: FRE emerges as superior to goal-conditioned baselines and prior zero-shot methods in AntMaze, ExORL, and Kitchen. Operator-based models achieve 95–105% of oracle TD3 policy returns on test rewards, strongly surpassing successor features (Tang et al., 2022). Hypernetworks generate near-optimal policies for unseen (context parameterized) MDPs, close to fully trained oracles (Rezaei-Shoshtari et al., 2022).

Zero-shot RL methods have shown utility in domains with limited or no simulator/data access, such as building control and high-dimensional embodied agents (Jeen et al., 2022, Tirinzoni et al., 15 Apr 2025). The PEARL algorithm matches oracle performance via a unified model-based pipeline using short commissioning, variance-based exploration, and probabilistic system identification, reducing emissions and maintaining thermal comfort (Jeen et al., 2022).

5. Challenges, Limitations, and Remedies

Several issues temper the utility of zero-shot RL:

Out-of-distribution actions: Value and successor measure overestimation for unseen actions in small or homogeneous datasets can degrade zero-shot generalization (Jeen et al., 2023, Zheng et al., 17 Oct 2025).
Representational limitations: MLP-based models may lack capacity for multimodal policies or fine-grained state coverage, motivating the use of attention and diffusion architectures (Zheng et al., 17 Oct 2025).
Partial observability and context inference: State/task misidentification in POMDPs degrades performance, requiring recurrent memory networks or joint context-policy models (Jeen et al., 18 Jun 2025, Ndir et al., 2024).
Task priors and specialization: Choices of reward prior (white noise, dense Gaussian) drive specialization or narrow skill learning; mixtures with goal priors and variance penalization can broaden coverage (Ollivier, 15 Feb 2025).
Environment design and generalization gap: Overfitting to training levels emerges as high mutual information between policy and instance identity; value-loss-based sampling and data-regularized environment design (DRED) reduce this, achieving strong zero-shot transfer and extrapolation (Garcin et al., 2024).
Data constraints: Real-world data scarcity and homogeneity require conservative regularization (VC-FB, MC-FB, IQL-style expectiles), ensuring robust performance without regret on high-quality datasets (Jeen et al., 2023, Zheng et al., 17 Oct 2025, Jeen, 22 Aug 2025).

6. Extensions and Future Directions

Key active directions and open problems include:

Scaling to rich observations: Integrating image, video, language, and multimodal specifications into zero-shot RL pipelines toward behavioral foundation models of foundation-scale (Ventura et al., 23 Oct 2025, Tirinzoni et al., 15 Apr 2025).
Exploration and coverage: Incorporating explicit exploration or diversity objectives into pre-training to guarantee representation coverage across all relevant behaviors (Ventura et al., 23 Oct 2025).
Hybrid and meta approaches: Merging compositional zero-shot mechanisms with direct end-to-end pipelines for improved sample efficiency and adaptation (Ventura et al., 23 Oct 2025).
Online/continual learning: Extending zero-shot RL to continual, active, or meta-learning paradigms—both in context-inference (joint context-policy optimization) and in reward specification (Ndir et al., 2024, Bagatella et al., 1 Oct 2025).
Multi-agent, adversarial, online deployment: Expanding zero-shot methods to multi-agent, competitive, and adversarial settings, including real-world systems with feedback, uncertainty, and limited data availability (Jeen, 22 Aug 2025).
Benchmarking and evaluation: Developing dedicated zero-shot RL benchmarks stressing adversarial or high-frequency reward patterns, and formalizing generalization-gap and representation mutual information analyses (Ventura et al., 23 Oct 2025, Garcin et al., 2024).

Zero-shot RL synthesizes ideas from unsupervised RL, multitask and transfer RL, representation learning, function approximation, and operator theory. It draws foundational links to domain adaptation (e.g., DARLA for disentangled visual factors (Higgins et al., 2017)), hierarchical task generalization via analogy-making (Oh et al., 2017), and contextual RL leveraging hypernetworks and context inference (Rezaei-Shoshtari et al., 2022, Ndir et al., 2024). The design of foundation-scale behavioral models and RL agents that can serve arbitrary tasks without fine-tuning positions zero-shot RL as a cornerstone methodology for next-generation generalist intelligence.

References: (Ventura et al., 23 Oct 2025, Ingebrand et al., 2024, Touati et al., 2022, Zheng et al., 17 Oct 2025, Jeen et al., 2023, Garcin et al., 2024, Jeen et al., 18 Jun 2025, Tang et al., 2022, Frans et al., 2024, Ollivier, 15 Feb 2025, Tirinzoni et al., 15 Apr 2025, Rezaei-Shoshtari et al., 2022, Higgins et al., 2017, Oh et al., 2017, Jeen, 22 Aug 2025, Ndir et al., 2024, Bagatella et al., 1 Oct 2025, Jeen et al., 2022).