Papers
Topics
Authors
Recent
Search
2000 character limit reached

Count-Based Soft Q-Learning (CBSQL)

Updated 3 July 2026
  • The paper introduces CBSQL, a modification of Soft Q-Learning that uses state-dependent inverse-temperature parameters based on visit counts.
  • It replaces the fixed β value with an adaptive schedule, enabling high exploration in early training and focused exploitation later on.
  • Empirical results in tabular and deep RL settings show that CBSQL improves convergence and stability compared to standard SQL and DQN baselines.

Count-Based Soft Q-Learning (CBSQL) is an algorithmic modification of Soft Q-Learning (SQL) in the maximum entropy reinforcement learning (MaxEnt RL) paradigm. CBSQL replaces the conventional constant inverse-temperature parameter with a state-dependent schedule that adapts dynamically according to state visit counts or density-model pseudo-counts, resulting in an adaptive tradeoff between exploration and exploitation throughout the learning process (Hu et al., 2021).

1. MaxEnt RL, Soft Q-Learning, and the β Parameter

Traditional reinforcement learning optimizes the expected discounted reward:

π=argmaxπEspπγ[Eaπ(s)[r(s,a)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)]\right]

Maximum entropy RL augments this with an entropy bonus, formulated as:

π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]

where H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s), and β>0\beta > 0 serves as an inverse-temperature. As β\beta \rightarrow \infty, the entropy term vanishes and standard RL is recovered; β0\beta \rightarrow 0 yields a uniform policy.

SQL applies a “soft” Bellman operator:

BβQ(s,a)=r(s,a)+γEs[1βlogaexp(βQ(s,a))]\mathcal{B}_\beta\, Q(s,a) = r(s,a) + \gamma\, \mathbb{E}_{s'}\left[\,\frac{1}{\beta} \log \sum_{a'} \exp(\beta Q(s',a'))\,\right]

The log\log–sum–exp is the mellowmax operator, mediating between greedy (β\beta \to \infty) and random (β0\beta \to 0) value backups, effecting an entropy-regularized update.

2. Motivation for State-Dependent Temperature Scheduling

Empirical and theoretical considerations indicate that the optimal entropy-regularization strategy is nonstationary and state-dependent. Early in training, Q-estimates are noisy; a high temperature (low π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]0) promotes valuable exploration. As learning progresses and Q-estimates for frequently visited states become accurate, reducing the temperature (raising π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]1) shrinks policy entropy, encouraging exploitation. This “confidence” in Q-values is inherently state-specific, tracking the effective evidence available. Fixed or globally-annealed π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]2 cannot properly capture this local learning progress, leading to undesirable tradeoffs in nonstationary or heterogeneous domains (Hu et al., 2021).

3. Count-Based Temperature Scheduling: Derivation and Properties

CBSQL introduces a statewise inverse-temperature parameter. With π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]3 the effective visit-count for state π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]4 at update π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]5, the statewise parameter is

π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]6

with π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]7 a tunable constant. For tabular domains, π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]8 is the Q-update count for π=argmaxπEspπγ[Eaπ(s)[r(s,a)]+1βH[π(s)]]\pi^* = \arg\max_\pi\, \mathbb{E}_{s\sim p_\pi^\gamma}\left[\,\mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)] + \frac{1}{\beta} H[\pi(\cdot|s)]\,\right]9. In large or continuous state spaces, CBSQL employs pseudo-counts from a density model H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)0:

H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)1

where H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)2 is the model probability after H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)3 updates and H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)4 after an additional update on H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)5. This pseudo-count increases monotonically and behaves like a visit-count.

The corresponding temperature parameter is H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)6. As H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)7 for recurrent states, H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)8, H[π(s)]=aπ(as)logπ(as)H[\pi(\cdot|s)] = -\sum_a \pi(a|s) \log \pi(a|s)9, and β>0\beta > 00, yielding greedy backups in well-known states. For novel or rarely visited states, the pseudo-count is small, maintaining high temperature and exploration.

4. Algorithmic Formulation and Soft Bellman Backup

With adaptive temperatures, the soft Bellman backup becomes:

β>0\beta > 01

or equivalently in terms of β>0\beta > 02:

β>0\beta > 03

Pseudocode outlines the integration of count-based temperature scheduling into deep SQL, including experience replay, density model updates, and periodic target network synchronization. The key differentiators are lines introducing (pseudo-)count calculation and the corresponding update to β>0\beta > 04 before constructing the soft Q-target.

5. Hyperparameters, Theoretical Considerations, and Practicalities

Major hyperparameters include:

  • β>0\beta > 05 (temperature-growth coefficient): Controls the rate at which β>0\beta > 06 increases. Typical working value is β>0\beta > 07.
  • Density model parameters for pseudo-count computation (e.g., context-tree switching (CTS) β>0\beta > 08s, context depths), which regulate how rapidly pseudo-counts accumulate.
  • Standard deep RL settings (learning rate β>0\beta > 09, replay buffer size, batch size, discount factor β\beta \rightarrow \infty0, β\beta \rightarrow \infty1-greedy schedule, network architecture) follow DQN/SQL practices.

For any fixed β\beta \rightarrow \infty2, the soft Bellman operator is a contraction (in sup-norm), ensuring a unique fixed point. As statewise β\beta \rightarrow \infty3 increases over time, the CBSQL operator interpolates between an entropy-dominated regime and the pure max-operator, resulting in stable early exploration and convergence to greedy optimality for recurrent states (Hu et al., 2021).

6. Empirical Evaluations and Comparative Performance

CBSQL was assessed in both tabular and deep RL settings:

Noisy Chain Toy Domain:

  • Five states linearly arranged, two actions per state, stochastic rewards.
  • Only the “go right” action at the terminal state yields positive reward.
  • CBSQL with true counts converged more rapidly and reliably to near-optimal policy (by ≈100 episodes) than all fixed-β\beta \rightarrow \infty4 SQL baselines and standard Q-learning.

Atari 2600 (Deep RL):

  • Six games: Breakout, Freeway, Pong, Q*bert, Seaquest, Space Invaders.
  • Preprocessing: grayscale 84×84 images, four-frame stack, reward clipping β\beta \rightarrow \infty5, β\beta \rightarrow \infty6.
  • Training: β\beta \rightarrow \infty7 frames, β\beta \rightarrow \infty8-greedy annealed to 0.1, Adam optimizer, replay buffer size β\beta \rightarrow \infty9, batch size 32.
  • Baselines: DQN, SQL with β0\beta \rightarrow 00 and β0\beta \rightarrow 01.

Summary performance (mean β0\beta \rightarrow 02 std over three seeds; metric: average reward over last 100 test episodes):

Game DQN SQL(100) SQL(1000) CBSQL
Breakout 5.9 ± 5.9 5.9 ± 4.5 5.1 ± 4.7 8.2 ± 6.1
Freeway 21.0 ± 1.5 14.6 ± 8.5 22.6 ± 4.7 25.8 ± 4.9
Pong 1.9 ± 2.6 17.8 ± 2.2 16.3 ± 2.7 17.6 ± 2.0
Q*bert 568.4 ± 1101 828 ± 1412 564.5 ± 1098 875.3 ± 1255
Seaquest 13.5 ± 24.1 4.0 ± 60.2 17.2 ± 24.0 84.6 ± 60.2
SpaceInv 132.7 ± 113 158.9 ± 128.5 132.3 ± 118.4 138.9 ± 112.8

CBSQL displayed consistent improvement over both SQL (with fixed β0\beta \rightarrow 03) and DQN baselines. Integration with Rainbow’s extensions (double-DQN, prioritized replay, dueling nets, noisy nets, distributional, multi-step) yielded further performance increases (e.g., Breakout: CBSQL+Rainbow 39.9 vs Rainbow-DQN 10.2 after 500K frames).

7. Empirical Observations: Ablation, Stability, and Exploration-Exploitation Tradeoff

Temperature as a function of visit-count decays as β0\beta \rightarrow 04, providing high initial exploration (β0\beta \rightarrow 05 as β0\beta \rightarrow 06) and automatic decay as learning accrues. Empirical learning curves indicate that constant β0\beta \rightarrow 07 values set too high impede early learning, while CBSQL’s adaptive scheme finds an effective balance. Early high β0\beta \rightarrow 08 stabilizes soft Q-updates by tempering the influence of noisy value estimates, and later low β0\beta \rightarrow 09 focuses updates on exploitation. CBSQL stabilizes learning trajectories while eliminating the need for extra schedule tuning or domain-specific parameters (Hu et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Count-Based Soft Q-Learning (CBSQL).