Count-Based Soft Q-Learning (CBSQL)
- The paper introduces CBSQL, a modification of Soft Q-Learning that uses state-dependent inverse-temperature parameters based on visit counts.
- It replaces the fixed β value with an adaptive schedule, enabling high exploration in early training and focused exploitation later on.
- Empirical results in tabular and deep RL settings show that CBSQL improves convergence and stability compared to standard SQL and DQN baselines.
Count-Based Soft Q-Learning (CBSQL) is an algorithmic modification of Soft Q-Learning (SQL) in the maximum entropy reinforcement learning (MaxEnt RL) paradigm. CBSQL replaces the conventional constant inverse-temperature parameter with a state-dependent schedule that adapts dynamically according to state visit counts or density-model pseudo-counts, resulting in an adaptive tradeoff between exploration and exploitation throughout the learning process (Hu et al., 2021).
1. MaxEnt RL, Soft Q-Learning, and the β Parameter
Traditional reinforcement learning optimizes the expected discounted reward:
Maximum entropy RL augments this with an entropy bonus, formulated as:
where , and serves as an inverse-temperature. As , the entropy term vanishes and standard RL is recovered; yields a uniform policy.
SQL applies a “soft” Bellman operator:
The –sum–exp is the mellowmax operator, mediating between greedy () and random () value backups, effecting an entropy-regularized update.
2. Motivation for State-Dependent Temperature Scheduling
Empirical and theoretical considerations indicate that the optimal entropy-regularization strategy is nonstationary and state-dependent. Early in training, Q-estimates are noisy; a high temperature (low 0) promotes valuable exploration. As learning progresses and Q-estimates for frequently visited states become accurate, reducing the temperature (raising 1) shrinks policy entropy, encouraging exploitation. This “confidence” in Q-values is inherently state-specific, tracking the effective evidence available. Fixed or globally-annealed 2 cannot properly capture this local learning progress, leading to undesirable tradeoffs in nonstationary or heterogeneous domains (Hu et al., 2021).
3. Count-Based Temperature Scheduling: Derivation and Properties
CBSQL introduces a statewise inverse-temperature parameter. With 3 the effective visit-count for state 4 at update 5, the statewise parameter is
6
with 7 a tunable constant. For tabular domains, 8 is the Q-update count for 9. In large or continuous state spaces, CBSQL employs pseudo-counts from a density model 0:
1
where 2 is the model probability after 3 updates and 4 after an additional update on 5. This pseudo-count increases monotonically and behaves like a visit-count.
The corresponding temperature parameter is 6. As 7 for recurrent states, 8, 9, and 0, yielding greedy backups in well-known states. For novel or rarely visited states, the pseudo-count is small, maintaining high temperature and exploration.
4. Algorithmic Formulation and Soft Bellman Backup
With adaptive temperatures, the soft Bellman backup becomes:
1
or equivalently in terms of 2:
3
Pseudocode outlines the integration of count-based temperature scheduling into deep SQL, including experience replay, density model updates, and periodic target network synchronization. The key differentiators are lines introducing (pseudo-)count calculation and the corresponding update to 4 before constructing the soft Q-target.
5. Hyperparameters, Theoretical Considerations, and Practicalities
Major hyperparameters include:
- 5 (temperature-growth coefficient): Controls the rate at which 6 increases. Typical working value is 7.
- Density model parameters for pseudo-count computation (e.g., context-tree switching (CTS) 8s, context depths), which regulate how rapidly pseudo-counts accumulate.
- Standard deep RL settings (learning rate 9, replay buffer size, batch size, discount factor 0, 1-greedy schedule, network architecture) follow DQN/SQL practices.
For any fixed 2, the soft Bellman operator is a contraction (in sup-norm), ensuring a unique fixed point. As statewise 3 increases over time, the CBSQL operator interpolates between an entropy-dominated regime and the pure max-operator, resulting in stable early exploration and convergence to greedy optimality for recurrent states (Hu et al., 2021).
6. Empirical Evaluations and Comparative Performance
CBSQL was assessed in both tabular and deep RL settings:
Noisy Chain Toy Domain:
- Five states linearly arranged, two actions per state, stochastic rewards.
- Only the “go right” action at the terminal state yields positive reward.
- CBSQL with true counts converged more rapidly and reliably to near-optimal policy (by ≈100 episodes) than all fixed-4 SQL baselines and standard Q-learning.
Atari 2600 (Deep RL):
- Six games: Breakout, Freeway, Pong, Q*bert, Seaquest, Space Invaders.
- Preprocessing: grayscale 84×84 images, four-frame stack, reward clipping 5, 6.
- Training: 7 frames, 8-greedy annealed to 0.1, Adam optimizer, replay buffer size 9, batch size 32.
- Baselines: DQN, SQL with 0 and 1.
Summary performance (mean 2 std over three seeds; metric: average reward over last 100 test episodes):
| Game | DQN | SQL(100) | SQL(1000) | CBSQL |
|---|---|---|---|---|
| Breakout | 5.9 ± 5.9 | 5.9 ± 4.5 | 5.1 ± 4.7 | 8.2 ± 6.1 |
| Freeway | 21.0 ± 1.5 | 14.6 ± 8.5 | 22.6 ± 4.7 | 25.8 ± 4.9 |
| Pong | 1.9 ± 2.6 | 17.8 ± 2.2 | 16.3 ± 2.7 | 17.6 ± 2.0 |
| Q*bert | 568.4 ± 1101 | 828 ± 1412 | 564.5 ± 1098 | 875.3 ± 1255 |
| Seaquest | 13.5 ± 24.1 | 4.0 ± 60.2 | 17.2 ± 24.0 | 84.6 ± 60.2 |
| SpaceInv | 132.7 ± 113 | 158.9 ± 128.5 | 132.3 ± 118.4 | 138.9 ± 112.8 |
CBSQL displayed consistent improvement over both SQL (with fixed 3) and DQN baselines. Integration with Rainbow’s extensions (double-DQN, prioritized replay, dueling nets, noisy nets, distributional, multi-step) yielded further performance increases (e.g., Breakout: CBSQL+Rainbow 39.9 vs Rainbow-DQN 10.2 after 500K frames).
7. Empirical Observations: Ablation, Stability, and Exploration-Exploitation Tradeoff
Temperature as a function of visit-count decays as 4, providing high initial exploration (5 as 6) and automatic decay as learning accrues. Empirical learning curves indicate that constant 7 values set too high impede early learning, while CBSQL’s adaptive scheme finds an effective balance. Early high 8 stabilizes soft Q-updates by tempering the influence of noisy value estimates, and later low 9 focuses updates on exploitation. CBSQL stabilizes learning trajectories while eliminating the need for extra schedule tuning or domain-specific parameters (Hu et al., 2021).