C-BET: Change-Based Exploration Transfer

Updated 20 November 2025

C-BET is an exploration-driven transfer learning framework that augments reinforcement learning with intrinsic rewards derived from state change events.
It employs pseudocounts over states and their transitions to generate robust intrinsic rewards, even in sparse-reward environments.
C-BET effectively transfers exploratory policies to new tasks, demonstrating enhanced exploratory coverage and improved performance in complex domains.

Change-Based Exploration Transfer (C-BET) is an exploration-driven transfer learning framework for reinforcement learning (RL) that augments standard learning protocols with intrinsic rewards derived from counts over “change events” in state transitions. C-BET seeks to address challenges posed by sparse-reward environments, leverages both agent-centric and environment-centric novelty, and provides a structured mechanism for transferring exploration policies to new tasks. Originally introduced in the context of model-free RL and later adapted for world-model agents such as DreamerV3, C-BET has demonstrated domain- and architecture-dependent efficacy for both tabula rasa learning and transfer setups, particularly in challenging, compositional environments (Parisi et al., 2021, Ferrao et al., 26 Mar 2025).

1. Theoretical Motivation and Intrinsic Reward Definition

C-BET formalizes exploration as the pursuit of rarely encountered transitions, operationalized by assigning pseudocounts to both state visitation and “change events” $c$ , where a change event encodes the difference between consecutive states, $c = s_{t+1}-s_t$ . The intrinsic reward at each step is

$r_i(s) = \frac{1}{n(s)+n(c)}$

where $n(s)$ is a hash-based visitation count of state $s$ and $n(c)$ is a count of the particular change $c$ . This formulation ensures agent-centric novelty (state counts) is combined with environment-centric interestingness (change counts), reflecting both the agent’s uncertainty over its own experience and the inherent novelty in the environment’s transitions.

To prevent intrinsic rewards from vanishing as counts grow unbounded, C-BET randomly resets each count at every step with probability $p \le 1 - \gamma_i$ (where $\gamma_i$ is the intrinsic-reward discount). This mechanism produces nontrivial intrinsic rewards even deep into training (Parisi et al., 2021, Ferrao et al., 26 Mar 2025).

2. Algorithmic Protocol and Integration into RL Agents

C-BET operates in both model-free (e.g., IMPALA) and model-based settings (e.g., DreamerV3). The procedure splits into two phases:

Tabula Rasa (Vanilla RL):

Agents are trained directly on a combination of extrinsic and intrinsic rewards:

$r_t(s) = r_e(s) + \alpha r_i(s)$

where $\alpha$ tunes intrinsic reward strength (selected by grid search).

Transfer Learning:

Pre-training: The agent is trained on source environments with only the intrinsic reward $r_i$ for a prescribed number of steps.
Transfer: The pre-trained exploration policy (e.g., policy logits $f_i$ in IMPALA or $(w_i, f_i)$ in DreamerV3) is frozen, and a new policy/value head (or agent) is trained with extrinsic rewards in the target environment. The final task policy is calculated as:

$\pi_{\text{TASK}}(s, a) = \mathrm{softmax}(f_i(s,a) + f_e(s,a))$

or for DreamerV3, using latent state-encodings $z=w(x_{\le t})$ .

Pseudocode from (Ferrao et al., 26 Mar 2025):

Initialize counts $n(s), n(c)$ and agent(s).
At each step: execute action, observe next state and extrinsic reward, update counts, compute $r_i$ , compute $r_t$ , and update agent on $(s_t, a_t, s_{t+1}, r_t)$ .

3. Network Architectures, State and Change Encodings

In MiniGrid tasks, C-BET uses deep convolutional policies and value networks, typically with 3 convolutional layers followed by fully connected and LSTM layers for sequential processing, mirroring the IMPALA architecture. The count-based state is derived via hash functions applied on (potentially egocentric or panoramic) state representations. The “change event” $c(s, s')$ is constructed by concatenating multiple egocentric observations and computing a pixel-wise difference, or, in the Habitat environment, via ternary SimHash encoding for robust uniqueness (Parisi et al., 2021).

For DreamerV3 world-model agents, two full models are instantiated: an intrinsic-only and extrinsic-only agent, each with their own world-model and policy. After pre-training and fine-tuning, decision making ensembles both policies additively at the softmax layer. No additional architectural modules are introduced; transfer is implemented by combining output logits (Ferrao et al., 26 Mar 2025).

4. Experimental Evaluation and Key Findings

C-BET has been evaluated in both MiniGrid (procedurally generated grid-based navigation) and Crafter (open-world survival) environments:

Environment	Agent	Tabula Rasa: C-BET Impact	Transfer: C-BET Impact
Minigrid	DreamerV3	Returns ↓, Variance ↑	Initial lag, late acceleration
Minigrid	IMPALA	Slight gain	C-BET consistently superior
Crafter	DreamerV3	Marked improvement	C-BET consistently superior
Crafter	IMPALA	Small gain	C-BET consistently superior

Notable results:

In Crafter, DreamerV3+CBET attains higher returns than DreamerV3 alone, demonstrating value in deep, multi-step, compositional tasks.
In MiniGrid, DreamerV3+CBET performance drops compared to standard DreamerV3 (higher variance, lower returns), likely due to distraction from goal-focused exploration.
Transfer experiments reveal that C-BET pre-training does not guarantee rapid task performance “out of the box”; the effect is environment-dependent.
C-BET outperforms all tested baselines—including count-only, RIDE, curiosity, and RND—on measures of exploratory coverage and unique interactions, especially in transfer to previously unseen environments (Parisi et al., 2021).

5. Design Insights, Limitations, and Practical Considerations

Effectiveness is Environment-Dependent:

C-BET is effective in tasks requiring deep exploration where novelty-driven behavior is aligned with discovering subgoals or acquiring diverse skills (e.g., crafting in Crafter). In narrowly defined tasks (e.g., Minigrid unlock), the same novelty signal can distract and degrade policy learning by drawing the agent’s attention away from immediate objectives (Ferrao et al., 26 Mar 2025).

Resource Overhead:

Applying C-BET to world-model agents such as DreamerV3 doubles VRAM usage, since two complete models are instantiated and maintained throughout training.

Hyperparameter Sensitivity:

The intrinsic/extrinsic balance parameter $\alpha$ demands environment- and architecture-specific tuning, typically via grid search:

Algorithm	Minigrid	Crafter
IMPALA	0.0025	0.005
DreamerV3	0.0025	0.001

Reward Formulation and Reset Mechanisms:

Experimentally, mixing state and change-count pseudocounts inside the reward denominator produces superior results to separate or multiplicative bonuses. Periodic random resets are critical to prevent rewards from vanishing with time (Parisi et al., 2021).

Transfer Mechanism:

In all evaluated forms, C-BET transfers only the pre-trained policy logits (frozen), not the value head; no fine-tuning of the exploration policy head is performed during downstream transfer.

Implementation:

Source code: https://github.com/sparisi/cbet/
Framework: TorchBeast + IMPALA, large-scale parallel training (e.g., 40 actors for MiniGrid).
Discount factors $\gamma_i = \gamma = 0.99$ .

6. Recommendations, Failure Modes, and Future Directions

Use intrinsic-strength scheduling: Start with a high weighting for intrinsic rewards to encourage broad exploration, and anneal toward zero for improved exploitation as learning progresses.
Consider “fractional transfer” (transferring only subsets of world-model parameters) to mitigate VRAM overhead in world-model agents.
Attempt to match the intrinsic reward design to the structure of the downstream task (e.g., count-based novelty in hierarchical environments, prediction-error-based curiosity for highly stochastic dynamics).
Reduce or omit C-BET in low-variance, goal-directed tasks to prevent harmful distractions.

In summary, C-BET operationalizes a structured, count-based novelty bonus over state transitions and enables transfer of exploratory behaviors to new tasks with minimal architectural changes. Its ability to improve exploration coverage and transfer performance surpasses baseline intrinsic-motivation approaches—especially in compositional, multitask, or under-explored domains—but care is required in narrowly defined, goal-oriented environments, and when integrating with resource-intensive world-model agents (Parisi et al., 2021, Ferrao et al., 26 Mar 2025).