Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stable Discrete SAC: Robust RL for Discrete Actions

Updated 10 April 2026
  • Stable Discrete SAC (SDSAC) is a reinforcement learning framework that extends the SAC method to discrete action spaces with enhanced stability and bias reduction.
  • It employs double-average Q-learning, explicit entropy penalties, and a Q-clip mechanism to mitigate issues such as Q-underestimation and entropy instability.
  • Empirical evaluations show that SDSAC achieves state-of-the-art performance and improved sample efficiency on benchmarks like Atari games and MOBA environments.

Stable Discrete Soft Actor-Critic (SDSAC) refers to a family of reinforcement learning algorithms that generalize the Soft Actor-Critic (SAC) framework to discrete action spaces, with algorithmic enhancements to robustly address instability and bias issues inherent in direct SAC analogues. The methodology integrates entropy-regularized policy optimization, double Q-learning adaptations, policy parameterization for categorical actions, and targeted stabilization mechanisms such as Q-clip and entropy-penalty, yielding state-of-the-art performance and sample efficiency in challenging discrete benchmarks such as the Atari suite and large-scale MOBA environments (Christodoulou, 2019, Zhou et al., 2022, Zhang et al., 2024).

1. Discrete SAC: Motivation and Instability in Vanilla Formulations

While SAC was originally proposed for continuous action spaces, its extension to discrete domains uncovered unique challenges. In particular, direct adoption of "clipped double Q-learning" (use of min\min between target Q-nets) in the discrete setting led to excessive underestimation of QQ-values; policies optimized under such pessimistic critics often perform suboptimally or even diverge. Additionally, the entropy bonus, central to the SAC paradigm, when left unconstrained in the discrete Bellman backup, introduced large oscillations in policy entropy, further undermining stability (Zhou et al., 2022).

These phenomena are summarized as:

  • Q-underestimation: Policy improvement step chases low Q-values due to over-pessimism from clipped minima, collapsing learning.
  • Entropy instability: Uncoupled entropy terms in the Bellman targets cause large, unpredictable swings in policy stochasticity and learning curves (Zhou et al., 2022, Christodoulou, 2019).

Empirical investigations established that without algorithmic modifications, vanilla discrete SAC is often less robust than DQN, C51, or Rainbow, especially in high-dimensional, stochastic environments (Zhou et al., 2022, Christodoulou, 2019).

2. Core Algorithmic Modifications for Stability

SDSAC introduces three primary modifications to the discrete-action SAC framework to ensure stable propagation of value estimates and policy gradients:

2.1 Double-Average Q-Learning

Rather than employing the clipped minimum of two target Q-networks as in continuous SAC, SDSAC averages their outputs:

Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )

Averaging mitigates the chronic underestimation bias of the min operator and produces more optimistic, yet bounded, targets. Ablations indicate that this adjustment alone prevents most policy collapses observed with clipped-min (Zhou et al., 2022).

2.2 Explicit Entropy Penalty in the Bellman Target

To control entropy dynamics, an additional entropy penalty is incorporated directly into the Bellman backup:

y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]

Here H(π(s))H( \pi(\cdot|s') ) denotes the policy entropy at ss', and λ\lambda is a tunable weight. This decouples entropy control from the implicit regularizer, producing smoother Q-value evolution and less variance in policy entropy (Zhou et al., 2022).

2.3 Q-Clip Mechanism

A Q-clip bound restricts the per-update change in Q-values:

δ=yQψk(s,a)\delta = y - Q_{\psi_k}(s,a)

y~=Qψk(s,a)+clip(δ,Δ,+Δ)\tilde y = Q_{\psi_k}(s,a) + \mathrm{clip}(\delta, -\Delta, +\Delta)

With Δ>0\Delta > 0 set small, the target QQ0 ensures the temporal-difference error per update is bounded, dramatically reducing the risk of Q-value explosion or instability (Zhou et al., 2022).

3. Objective Functions, Policy Parameterization, and Update Rules

The maximum-entropy RL objective is retained, adapted for discrete action spaces and categorical policy heads.

Discrete Policy

Policies are parameterized as categorical distributions:

QQ1

with QQ2 a learned preference over actions. The policy loss used is:

QQ3

The policy gradient employs the score-function estimator (REINFORCE with baseline), since discrete actions preclude reparameterization:

QQ4

where QQ5 is a baseline, typically the expected Q under the old policy, for variance reduction (Zhang et al., 2024).

Soft Bellman Backup

The Q-value update (for each QQ6) is given by the mean-squared error to a clipped target as above. Policy and temperature parameters are updated via stochastic gradients on their respective objectives.

Temperature / Entropy Tuning

The temperature QQ7 is either:

  • Fixed (with an entropy bonus QQ8 annealed to zero over late training epochs), or
  • Adaptively optimized with

QQ9

where Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )0 is a target entropy (often Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )1 for uniform mass), and the update is by gradient descent (Zhang et al., 2024, Christodoulou, 2019).

4. Training Details, Hyperparameters, and Network Architectures

The following summarizes key practical details, reported hyperparameters, and implementation guidance:

Component Implementation (per (Zhou et al., 2022, Zhang et al., 2024, Christodoulou, 2019)) Notable Parameters
Q-networks 2-headed, Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )2 output per state AdamW; Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )3
Policy net Linear Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )4 softmax atop shared backbone Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )5
Backbone Impala-CNN, ×4 width 2 linear layers (Q), 1 for policy
Entropy bonuses Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )6: linearly annealed (Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )7 over Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )8 updates) Annealing schedule
Target update Polyak avg: Qˉtgt(s,a)=12(Qψˉ1(s,a)+Qψˉ2(s,a))\bar Q_{\mathrm{tgt}}(s',a') = \tfrac{1}{2} ( Q_{\bar\psi_1}(s',a') + Q_{\bar\psi_2}(s',a') )9
Batch size, buffer 64, y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]0–y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]1 transitions
Discount, n-step y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]2 annealed y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]3, y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]4
Replay ratio (RR) RR=2 (baseline, high efficiency), RR=4/8 (scaling)
Reward clipping y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]5

Policy evaluation uses action sampling; training uses exact action expectations for critic/temperature objectives to minimize variance (Zhang et al., 2024, Christodoulou, 2019).

5. Empirical Evaluation and Ablation Insights

SDSAC evaluations span Atari 2600 (28 games) and high-dimensional, macro-action MOBA environments (Zhou et al., 2022, Zhang et al., 2024). Key findings across multiple studies:

  • Performance: SDSAC matches or outperforms Rainbow, DQN, and C51 baselines, with improvements of y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]6–y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]7 over vanilla SAC in select games (e.g., Pong, Breakout, Q*bert) (Zhou et al., 2022).
  • Sample efficiency: On Atari, untuned SDSAC achieves near state-of-the-art performance in y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]8 steps, with IQMs up to y=rλH(π(s))+γaπθ(as)[Qˉtgt(s,a)αlogπθ(as)]y = r - \lambda H( \pi(\cdot|s') ) + \gamma \sum_{a'} \pi_\theta(a'|s') [ \bar Q_{\mathrm{tgt}}(s',a') - \alpha \log \pi_\theta(a'|s') ]9 at RR=2, and H(π(s))H( \pi(\cdot|s') )0 at RR=8 for high-throughput variants (Zhang et al., 2024).
  • Variance ablations: Removing variance-reduction baselines or Q-clip results in severe performance drops (e.g., IQM from H(π(s))H( \pi(\cdot|s') )1 on 5-game subsets, or catastrophic value divergence in H(π(s))H( \pi(\cdot|s') )2 of runs) (Zhang et al., 2024, Zhou et al., 2022).
  • Entropy penalty removal: Leads to H(π(s))H( \pi(\cdot|s') )3 slower learning and more pronounced oscillations in value targets.
  • Double-average ablation: Reverting to clipped-min reduces scores by H(π(s))H( \pi(\cdot|s') )4 in several games; underestimation bias is empirically confirmed (Zhou et al., 2022).
  • Scaling and efficiency: At RR=2, SDSAC achieves super-human IQM in H(π(s))H( \pi(\cdot|s') )5 the runtime of Rainbow-based agents at RR=8 (Zhang et al., 2024).

6. Theoretical Guarantees and Broader Implications

  • Contraction properties: The soft Bellman operator remains a H(π(s))H( \pi(\cdot|s') )6-contraction in sup-norm under the discrete-action setup, guaranteeing existence and uniqueness of fixed points (Christodoulou, 2019).
  • Bias-variance tradeoff: Double-average Q-learning corrects the pessimism of clipped-min, while entropy penalty and Q-clip jointly decouple bias from variance amplification, yielding fast and stable convergence (Zhou et al., 2022).
  • Variance reduction: Score-function gradient estimators benefit critically from baseline subtraction in high-dimensional, discrete policies; ablations confirm necessity for non-trivial performance (Zhang et al., 2024).

A plausible implication is that, due to these algorithmic innovations, SDSAC can be reliably adopted as a default policy-gradient method in discrete-action domains historically dominated by purely value-based algorithms.

7. Variants, Extensions, and Practical Considerations

Several implementation-level variants have been proposed, distinguished by temperature tuning strategy (fixed H(π(s))H( \pi(\cdot|s') )7 or learned, sometimes replaced by modular entropy bonus H(π(s))H( \pi(\cdot|s') )8), policy parameterization details, and specific regularization strengths (Zhang et al., 2024, Christodoulou, 2019, Zhou et al., 2022). Integration into Rainbow-style backbones (e.g., SAC-BBF) extends efficacy to highly-compounded agent architectures, allowing controlled replay ratios and super-human IQM with reduced training walltime (Zhang et al., 2024).

Adoption in high-dimensional, multi-agent environments such as MOBA further validates the scalability of the approach. Detailed ablations from recent work confirm the indispensability of the three core mechanisms: double-average Q-learning, explicit entropy penalty, and Q-clip.

In summary, Stable Discrete SAC constitutes an empirically validated, theoretically principled framework for entropy-regularized, off-policy learning in discrete action spaces, robust against the instability and estimation bias that impede naive SAC analogues (Zhou et al., 2022, Zhang et al., 2024, Christodoulou, 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stable Discrete SAC (SDSAC).