Count-Based Soft Q-Learning (CBSQL)

Updated 3 July 2026

The paper introduces CBSQL, a modification of Soft Q-Learning that uses state-dependent inverse-temperature parameters based on visit counts.
It replaces the fixed β value with an adaptive schedule, enabling high exploration in early training and focused exploitation later on.
Empirical results in tabular and deep RL settings show that CBSQL improves convergence and stability compared to standard SQL and DQN baselines.

Count-Based Soft Q-Learning (CBSQL) is an algorithmic modification of Soft Q-Learning (SQL) in the maximum entropy reinforcement learning (MaxEnt RL) paradigm. CBSQL replaces the conventional constant inverse-temperature parameter with a state-dependent schedule that adapts dynamically according to state visit counts or density-model pseudo-counts, resulting in an adaptive tradeoff between exploration and exploitation throughout the learning process (Hu et al., 2021).

1. MaxEnt RL, Soft Q-Learning, and the β Parameter

Traditional reinforcement learning optimizes the expected discounted reward:

$\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)]\right]$

Maximum entropy RL augments this with an entropy bonus, formulated as:

$\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$

where $H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)$ , and $\beta > 0$ serves as an inverse-temperature. As $\beta \rightarrow \infty$ , the entropy term vanishes and standard RL is recovered; $\beta \rightarrow 0$ yields a uniform policy.

SQL applies a “soft” Bellman operator:

$\mathcal{B}_\beta\, Q(s,a) = r(s,a) + \gamma\, \mathbb{E}_{s'}\left[\,\frac{1}{\beta} \log \sum_{a'} \exp(\beta Q(s',a'))\,\right]$

The $\log$ –sum–exp is the mellowmax operator, mediating between greedy ( $\beta \to \infty$ ) and random ( $\beta \to 0$ ) value backups, effecting an entropy-regularized update.

2. Motivation for State-Dependent Temperature Scheduling

Empirical and theoretical considerations indicate that the optimal entropy-regularization strategy is nonstationary and state-dependent. Early in training, Q-estimates are noisy; a high temperature (low $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 0) promotes valuable exploration. As learning progresses and Q-estimates for frequently visited states become accurate, reducing the temperature (raising $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 1) shrinks policy entropy, encouraging exploitation. This “confidence” in Q-values is inherently state-specific, tracking the effective evidence available. Fixed or globally-annealed $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 2 cannot properly capture this local learning progress, leading to undesirable tradeoffs in nonstationary or heterogeneous domains (Hu et al., 2021).

3. Count-Based Temperature Scheduling: Derivation and Properties

CBSQL introduces a statewise inverse-temperature parameter. With $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 3 the effective visit-count for state $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 4 at update $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 5, the statewise parameter is

$\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 6

with $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 7 a tunable constant. For tabular domains, $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 8 is the Q-update count for $\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]$ 9. In large or continuous state spaces, CBSQL employs pseudo-counts from a density model $H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)$ 0:

$H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)$ 1

4. Algorithmic Formulation and Soft Bellman Backup

With adaptive temperatures, the soft Bellman backup becomes:

$\beta > 0$ 1

or equivalently in terms of $\beta > 0$ 2:

$\beta > 0$ 3

Pseudocode outlines the integration of count-based temperature scheduling into deep SQL, including experience replay, density model updates, and periodic target network synchronization. The key differentiators are lines introducing (pseudo-)count calculation and the corresponding update to $\beta > 0$ 4 before constructing the soft Q-target.

5. Hyperparameters, Theoretical Considerations, and Practicalities

Major hyperparameters include:

$\beta > 0$ 5 (temperature-growth coefficient): Controls the rate at which $\beta > 0$ 6 increases. Typical working value is $\beta > 0$ 7.
Density model parameters for pseudo-count computation (e.g., context-tree switching (CTS) $\beta > 0$ 8s, context depths), which regulate how rapidly pseudo-counts accumulate.
Standard deep RL settings (learning rate $\beta > 0$ 9, replay buffer size, batch size, discount factor $\beta \rightarrow \infty$ 0, $\beta \rightarrow \infty$ 1-greedy schedule, network architecture) follow DQN/SQL practices.

For any fixed $\beta \rightarrow \infty$ 2, the soft Bellman operator is a contraction (in sup-norm), ensuring a unique fixed point. As statewise $\beta \rightarrow \infty$ 3 increases over time, the CBSQL operator interpolates between an entropy-dominated regime and the pure max-operator, resulting in stable early exploration and convergence to greedy optimality for recurrent states (Hu et al., 2021).

6. Empirical Evaluations and Comparative Performance

CBSQL was assessed in both tabular and deep RL settings:

Noisy Chain Toy Domain:

Five states linearly arranged, two actions per state, stochastic rewards.
Only the “go right” action at the terminal state yields positive reward.
CBSQL with true counts converged more rapidly and reliably to near-optimal policy (by ≈100 episodes) than all fixed- $\beta \rightarrow \infty$ 4 SQL baselines and standard Q-learning.

Atari 2600 (Deep RL):

Six games: Breakout, Freeway, Pong, Q*bert, Seaquest, Space Invaders.
Preprocessing: grayscale 84×84 images, four-frame stack, reward clipping $\beta \rightarrow \infty$ 5, $\beta \rightarrow \infty$ 6.
Training: $\beta \rightarrow \infty$ 7 frames, $\beta \rightarrow \infty$ 8-greedy annealed to 0.1, Adam optimizer, replay buffer size $\beta \rightarrow \infty$ 9, batch size 32.
Baselines: DQN, SQL with $\beta \rightarrow 0$ 0 and $\beta \rightarrow 0$ 1.

Summary performance (mean $\beta \rightarrow 0$ 2 std over three seeds; metric: average reward over last 100 test episodes):

Game	DQN	SQL(100)	SQL(1000)	CBSQL
Breakout	5.9 ± 5.9	5.9 ± 4.5	5.1 ± 4.7	8.2 ± 6.1
Freeway	21.0 ± 1.5	14.6 ± 8.5	22.6 ± 4.7	25.8 ± 4.9
Pong	1.9 ± 2.6	17.8 ± 2.2	16.3 ± 2.7	17.6 ± 2.0
Q*bert	568.4 ± 1101	828 ± 1412	564.5 ± 1098	875.3 ± 1255
Seaquest	13.5 ± 24.1	4.0 ± 60.2	17.2 ± 24.0	84.6 ± 60.2
SpaceInv	132.7 ± 113	158.9 ± 128.5	132.3 ± 118.4	138.9 ± 112.8

CBSQL displayed consistent improvement over both SQL (with fixed $\beta \rightarrow 0$ 3) and DQN baselines. Integration with Rainbow’s extensions (double-DQN, prioritized replay, dueling nets, noisy nets, distributional, multi-step) yielded further performance increases (e.g., Breakout: CBSQL+Rainbow 39.9 vs Rainbow-DQN 10.2 after 500K frames).

7. Empirical Observations: Ablation, Stability, and Exploration-Exploitation Tradeoff

Temperature as a function of visit-count decays as $\beta \rightarrow 0$ 4, providing high initial exploration ( $\beta \rightarrow 0$ 5 as $\beta \rightarrow 0$ 6) and automatic decay as learning accrues. Empirical learning curves indicate that constant $\beta \rightarrow 0$ 7 values set too high impede early learning, while CBSQL’s adaptive scheme finds an effective balance. Early high $\beta \rightarrow 0$ 8 stabilizes soft Q-updates by tempering the influence of noisy value estimates, and later low $\beta \rightarrow 0$ 9 focuses updates on exploitation. CBSQL stabilizes learning trajectories while eliminating the need for extra schedule tuning or domain-specific parameters (Hu et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Count-Based Soft Q-Learning (CBSQL).