RL-Zero: Zero-Shot Reinforcement Learning

Updated 19 December 2025

RL-Zero is a framework for zero-shot reinforcement learning that decouples reward acquisition from policy inference, enabling rapid generalization to new tasks.
It integrates methods from classical control, such as successor features and forward-backward representations, with modern language-to-behavior transfer techniques.
RL-Zero approaches achieve near-optimal performance by leveraging unsupervised pretraining and structured representations, paving the way for universal, instruction-following agents.

RL-Zero refers to a set of methodologies and theoretical frameworks for zero-shot reinforcement learning (RL), in which an agent—often a neural network model—can rapidly (or instantly) infer effective policies for new tasks or instructions, without any explicit in-domain supervision or supervised fine-tuning. Its fundamental objective is to achieve, after pretraining on reward-free data or with only sparse outcome-based signals, a level of “controllability” in policy formation: enabling agents to generalize to novel goals, reward functions, or even complex user prompts at test time. Under the RL-Zero umbrella lies a spectrum of approaches, from classical control with guaranteed zero-shot adaptation using structured representations, to modern scaling laws in language-model reasoning, to end-to-end language-to-behavior frameworks in both discrete and continuous domains.

1. Formal Problem Setting and Principal Algorithms

The RL-Zero paradigm aims to decouple reward maximization from reward signal acquisition. In the canonical setting, the RL-Zero agent is pretrained in a reward-free or unsupervised phase, typically by modeling state transitions, learning representation priors, or using self-supervised objectives. At deployment, it is assumed to (a) receive a new reward function, goal signal, or instruction, and (b) infer a (near-)optimal policy instantly—i.e., without further learning, adaptation steps, or reward-based updates.

Classic RL-Zero Frameworks

Successor Features (SFs): The agent models a feature map $\phi : \mathcal{S} \to \mathbb{R}^d$ , then for each policy $\pi_z$ indexed by $z \in \mathbb{R}^d$ , defines successor features $\psi^\pi(s,a) = \mathbb{E}[\sum_{t\geq 0} \gamma^t \phi(s_{t+1})]$ . Given a new reward $r(s)=\phi(s)^\top z$ , the agent forms $\pi_z(s)=\arg\max_a \psi(s,a;z)^\top z$ and achieves zero-shot optimality when $r$ is in the span of $\phi$ (Touati et al., 2022).
Forward-Backward (FB) Representations: Here, the agent models $M^{\pi_z}(s,a,s') \approx F(s,a,z)^\top B(s')\rho(ds')$ and learns $F, B$ jointly, enabling optimal zero-shot inference across reward families (including arbitrary downstream tasks), with demonstrated 85% of offline supervised performance across control tasks (Touati et al., 2022).

Direct Language-to-Policy Transfer

RLZero: The agent receives a prompt $e^\ell$ (text description of the task), “imagines” a sequence of video frames using a generative video-LLM, “projects” these frames into the domain of previously observed states via retrieval in joint embedding space, and “imitates” the imagined trajectory through closed-form policy inference using a pretrained successor-measure model (“behavior foundation model”). The imitation step minimizes KL divergence between the imagined and candidate visitation distributions, yielding an instant policy for the open-ended prompt (Sikchi et al., 7 Dec 2024).

LLMs and Zero-RL

Zero-RL (a.k.a. R1-Zero) for Reasoning: In recent scaling work, RL-Zero is operationalized by directly applying policy-gradient RL (typically PPO or GRPO) to pretrained LLMs, with rewards derived from deterministic verifiers (e.g., math answer correctness) and without upstream supervised fine-tuning. The LLM is thus “taught to reason” solely by outcome reward, often observing dramatic gains in chain-of-thought (CoT) length and “self-reflective” behaviors (Zeng et al., 24 Mar 2025, Liu et al., 26 Mar 2025).

2. Key Components: Representation, Reward, and Policy Structure

Representation Learning

RL-Zero agents rely critically on the quality and coverage of their learned representations. In classic control, the ability to linearly span downstream rewards from $\phi(s)$ or $B(s)$ is essential. In the video-language paradigm, pretrained multimodal embeddings (e.g., SigLIP, InternVideo2) provide semantic grounding, and effective retrieval/reconstruction is crucial for accurate “projection” (Sikchi et al., 7 Dec 2024).

Reward Signal and Policy Optimization

Unsupervised or Self-Supervised Phase: Common in classical RL-Zero and control, this is entirely reward-free, using transition modeling, auxiliary contrastive losses, or bisimulation criteria (Mazoure et al., 2021).
Rule-Based or Verifier Signals: In LLM zero-RL, the reward is supplied by a deterministic or learned verifier, often $\{0,1\}$ for correct/incorrect answers. Auxiliary format or length penalties may be added to prevent reward hacking (e.g., artificially elongated outputs) (Zeng et al., 24 Mar 2025, Zeng et al., 29 Oct 2025).
Policy Gradient with Group Baselines: Optimizers such as GRPO or Dr. GRPO utilize group-wise or leave-one-out baselines for advantage estimation across sampled rollouts. The precise normalization significantly affects emergent behaviors—incorrect normalizations lead to length or response biases (Liu et al., 26 Mar 2025).

3. Empirical Findings and Benchmark Results

Classic RL-Zero in Control

Method	Zero-Shot Perf. (% of Offline TD3)	Key Finding
Forward-Backward	85	Consistent across tasks and replay buffers; robust on maze, walker, cheetah, quadruped
SF + Laplacian	73	Strong when $\phi$ are Laplacian eigenfunctions
SF + AE, ICM, APS, etc.	41 or lower	Unreliable generalization

FB consistently outperforms traditional SF approaches unless strong priors (e.g., Laplacian features) are available (Touati et al., 2022).

LLM Zero-RL for Reasoning

Accuracy Gains: Llama3-8B: 39.7% $\to$ 79.2% (GSM8K), Mistral-Small-24B: 78.6% $\to$ 92.0% (GSM8K); Qwen2.5 models generally benefit less due to strong base performance (Zeng et al., 24 Mar 2025).
Cognitive Behaviors: RL-Zero elicits “Verification” and “Enumeration” (the “aha moment”) even in non-Qwen small models—emerging only after sufficient RL steps (Zeng et al., 24 Mar 2025).
Failure Modes: Strict format rewards collapse exploration; data-difficulty mismatch causes optimization collapse (e.g., Mistral-7B fails on “Hard” GSM8K + MATH subsets) (Zeng et al., 24 Mar 2025).
Response Length vs. Reasoning: Increased CoT length does not always correspond to true cognitive behaviors; “clip ratio” and average valid response length must be monitored (Zeng et al., 24 Mar 2025).

RLZero in Open-Ended Continuous Domains

RLZero wins 83.2% of head-to-head evaluations against the best baseline (offline RL with embedding rewards) and 80% for cross-embodiment video imitation, with no in-domain language-behavior labels (Sikchi et al., 7 Dec 2024).

General-Domain Zero-RL

Multi-task RL-Zero (combining verifiable and generative reward domains) achieves substantial gains in both math reasoning and open-ended benchmarks (e.g., MATH-500 accuracy: 92.4% for Qwen3-14B, outperforming instruct-tuned comparators), facilitated by smooth length penalties to prevent output hacking (Zeng et al., 29 Oct 2025).

4. Common Limitations, Remedies, and Comparisons

Zero-Reward Barrier and Curriculum Design

If the base model never samples a correct answer, binary-outcome RL collapses—the gradient is zero for all trajectories and learning halts. Sophisticated credit assignment, diversity encouragement, or reward-shaping methods (VinePPO, step-level MC baselines) are ineffective in “dead” regimes. A minimal curriculum (introducing easier instances mixed with unsolved ones) breaks the barrier, allowing RL-Zero protocols to bootstrap non-trivial performance without algorithmic modification (Prakash et al., 4 Oct 2025).

Distillation vs. Zero-RL

Distillation with as few as 920 chain-of-thought examples (from a strong teacher such as DeepSeek R1) reliably surpasses zero-RL on downstream reasoning benchmarks. Distilled models exhibit 2–4× higher incidence of multi-perspective thinking and metacognitive behaviors compared to RL-Zero, with higher output diversity and flexibility of reasoning (e.g., more logical connectors, anthropomorphic tokens), even when such tokens are banned in sampling. RL-Zero on smaller models (≤32B) does not produce similar advanced cognitive patterns and is less data-efficient (Hu et al., 27 May 2025).

Reward Hacking and Biases in Policy Optimization

Policy-gradient optimizers such as GRPO introduce length bias; Dr. GRPO corrects this and avoids overgeneration of long, incoherent outputs without affecting expected accuracy (Liu et al., 26 Mar 2025). Auxiliary penalties (e.g., smooth length control) and well-matched data difficulty are required to maintain output diversity and chain-of-thought richness, especially in generative and open-ended tasks (Zeng et al., 29 Oct 2025).

5. Extensions: Representation-First Zero-Shot RL and Beyond

Self-Supervised Representation Learning for Generalization

Cross-Trajectory Representation Learning (CTRL) demonstrates that shaping the encoder using a self-supervised clustering and prediction loss—without propagating gradients from reward or value functions—yields higher generalization to new task settings, even in visually rich RL environments. CTRL-induced “pseudo-bisimulation” partitions agent behaviors by functional similarity, outperforming pure reward-driven baselines. The broader principle: decouple abstract state representation from reward signals to enable robust out-of-distribution zero-shot RL (Mazoure et al., 2021).

Scalable Zero-Shot Foundation Models

RLZero points to a pathway where an unsupervised “behavior foundation model” can generalize arbitrarily: closed-form policy computation from reward or instruction, zero-shot adaptation across embodiment, and unsupervised transfer from exogenous data modalities (e.g., cross-agent videos) (Sikchi et al., 7 Dec 2024).

Controllability and Universal Instruction Following

FB representations and RLZero both instantiate the notion of “controllable agents”: agents that, after pretraining, can follow arbitrary instructions (reward functions, language prompts, video snippets) without further gradient steps or reward engineering, at least for environments sufficiently explored during the unsupervised phase (Touati et al., 2022, Sikchi et al., 7 Dec 2024).

6. Outlook: Practical Recommendations and Open Challenges

Leverage structured representation learning (e.g., FB, Laplacian SFs) for offline zero-shot RL in low- and mid-dimensional state spaces.
In LLMs, always prefer curriculum-based data design and unbiased policy-gradient formulations; avoid length and format biases.
For reasoning tasks, if base models are weak, small-scale distillation or domain-specific pretraining is more efficient than pure RL-Zero.
RLZero and FB architectures highlight a route to universal, truly zero-shot controllable agents, but bottlenecks remain in representation coverage, robustness under distribution shift, and evaluation of open-ended instruction following.
Theoretical advances are needed to clarify which priors and representation learning schemes guarantee universal zero-shot coverage for broad classes of reward or instruction spaces. Benchmarking protocols should emphasize cold-start regimes and out-of-distribution generalization (Touati et al., 2022, Prakash et al., 4 Oct 2025).

RL-Zero thus represents both a unifying concept for zero-shot policy inference and a diverse set of practical algorithms, serving as a blueprint for the next generation of generalist, instruction-following RL agents across both discrete and continuous domains.