Papers
Topics
Authors
Recent
Search
2000 character limit reached

Soft Actor Critic Algorithm

Updated 29 January 2026
  • Soft Actor Critic (SAC) is a deep reinforcement learning algorithm defined by a maximum-entropy objective, enabling effective exploration and robust policy improvement.
  • It employs twin Q critics and automatic temperature adaptation to balance bias control with high sample efficiency and stable learning.
  • In discrete-action variants, SAC adjusts network architecture and backup computations to compute exact action expectations, significantly reducing estimator variance.

Soft Actor Critic (SAC) is a deep reinforcement learning (RL) algorithm developed to achieve high sample efficiency, effective exploration, and robust policy improvement by maximizing entropy in addition to expected return. SAC has substantially influenced model-free off-policy RL for both continuous and discrete action domains, and its core principles have served as a foundation for numerous subsequent advances in RL. The canonical SAC formulation employs a maximum-entropy objective, combined with a stochastic actor and twin Q-function critics, and is equipped with distinctive algorithmic mechanisms such as automatic temperature adaptation and reparameterized policy gradients (Haarnoja et al., 2018, Haarnoja et al., 2018). Discrete-action SAC variants extend these principles, with architectural and backup modifications tailored to finite action spaces (Christodoulou, 2019).

1. Maximum Entropy Objective and Policy Iteration

SAC is defined by the maximization of a stochastic policy’s expected cumulative reward regularized by its entropy: J(π)=Eτπ[t=0γt(r(st,at)+αH(π(st)))]J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^\infty \gamma^t ( r(s_t, a_t) + \alpha\, \mathcal{H}(\pi(\cdot \mid s_t)) ) \right] where γ\gamma is the discount factor, α>0\alpha>0 is the entropy (temperature) coefficient, and H(π(s))=Eaπ[logπ(as)]\mathcal{H}(\pi(\cdot|s)) = -\mathbb{E}_{a \sim \pi}[\,\log \pi(a|s)] (Haarnoja et al., 2018, Haarnoja et al., 2018, Christodoulou, 2019).

In the continuous-action case, the policy is typically parameterized as a squashed Gaussian. Policy evaluation employs a “soft” Bellman backup: \begin{align*} Q\pi(s,a) &= r(s,a) + \gamma\,\mathbb{E}{s'}\Bigl[V{\pi}(s')\Bigr] \ V\pi(s) &= \mathbb{E}{a \sim \pi}\Bigl[ Q\pi(s,a) - \alpha \log \pi(a|s) \Bigr] \end{align*} Policy improvement minimizes the reverse KL divergence between the policy and a Boltzmann distribution over Q-values: πnew(s)=argminπΠDKL(π(s)exp(Qπ(s,)/α)Zπ(s))\pi_{\text{new}}(\cdot|s) = \arg\min_{\pi' \in \Pi} D_{\mathrm{KL}}\left( \pi'(\cdot|s) \,\Big\|\, \frac{ \exp(Q^\pi(s, \cdot)/\alpha) }{Z^\pi(s) } \right) All steps generalize to discrete action spaces, with expectations replaced by sums (Christodoulou, 2019).

2. Twin-Q Critics, Policy Updates, and Temperature Adaptation

SAC employs an off-policy actor-critic design with two independent Q-networks, Qθ1Q_{\theta_1} and Qθ2Q_{\theta_2}(Haarnoja et al., 2018, Christodoulou, 2019). The soft Bellman update is: y=r+γ[minj=1,2Qθˉj(s,a)αlogπϕ(as)],aπϕy = r + \gamma \left[ \min_{j=1,2} Q_{\bar{\theta}_j}(s', a') - \alpha \log \pi_\phi(a'|s') \right],\quad a' \sim \pi_\phi This clipped double-Q arrangement controls overestimation bias. The corresponding critic loss is: LQ(θi)=E(s,a,r,s)D[12(Qθi(s,a)y)2]\mathcal{L}_Q(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \tfrac{1}{2} ( Q_{\theta_i}(s,a) - y )^2 \right]

For the actor (policy), the canonical SAC formulation uses the reparameterization trick for stable, low-variance gradients in continuous domains: Lπ(ϕ)=EsD,ϵN[αlogπϕ(fϕ(ϵ;s)s)mini=1,2Qθi(s,fϕ(ϵ;s))]\mathcal{L}_\pi(\phi) = \mathbb{E}_{s \sim D,\,\epsilon \sim \mathcal{N}} \left[ \alpha \log \pi_\phi(f_\phi(\epsilon; s)|s) - \min_{i=1,2} Q_{\theta_i}(s, f_\phi(\epsilon; s)) \right] In discrete SAC, the policy πϕ(s)ΔA\pi_\phi(s) \in \Delta^{|A|} is parametrized via softmax and the update becomes: Jπ(ϕ)=EsD[aπϕ(as)(αlogπϕ(as)Qθ1(s,a))]J_\pi(\phi) = \mathbb{E}_{s \sim D} \left[ \sum_a \pi_\phi(a|s) ( \alpha \log \pi_\phi(a|s) - Q_{\theta_1}(s,a) ) \right] No reparameterization is required, and action expectations are computed exactly by summation (Christodoulou, 2019).

Temperature α\alpha can be adaptively tuned by minimizing: J(α)=Eaπϕ,sD[α(logπϕ(as)+Hˉ)]J(\alpha) = \mathbb{E}_{a \sim \pi_\phi,\,s \sim D} [ -\alpha ( \log \pi_\phi(a|s) + \bar{H} ) ] with target entropy Hˉ\bar{H}, supporting automatic exploration-exploitation tradeoff (Haarnoja et al., 2018, Christodoulou, 2019).

3. Discrete-Action SAC: Architectural and Backup Adjustments

In discrete action domains, direct sampling of actions is replaced by explicit sums. Key architectural and computational modifications are:

  • The Q-network outputs Qθ(s)RAQ_\theta(s) \in \mathbb{R}^{|A|} (vector over available actions).
  • The policy outputs a categorical probability vector via softmax, πϕ(s)[0,1]A\pi_\phi(s) \in [0,1]^{|A|}.
  • All value, entropy, and expectation computations use exact sums rather than Monte Carlo, reducing estimator variance.
  • Bellman backup and policy expectation are implemented as matrix-vector operations (Christodoulou, 2019).

In pseudocode, update steps involve:

  • Policy evaluation using V(s)=πϕ(s)T[miniQˉθi(s)αlogπϕ(s)]V(s') = \pi_\phi(s')^T[ \min_i \bar{Q}_{\theta_i}(s') - \alpha \log \pi_\phi(s') ]
  • Critic updates with target y=r+γV(s)y = r + \gamma V(s')
  • Policy update via exact expectation (no reparameterization trick) (Christodoulou, 2019).

This architecture eliminates reparameterization and action sampling noise, significantly lowering variance in update targets.

4. Empirical Performance: Atari Benchmarks and Key Hyperparameters

SAC-Discrete, with minimal hyperparameter tuning, was benchmarked on 20 Atari games and compared to the tuned Rainbow baseline:

  • Network: three convolutional layers ([32,64,64] channels, [8,4,3] kernel, strides [4,2,1]) and two FC layers ([512, |A|]).
  • Batch size: 64; buffer: 10610^6; γ=0.99\gamma=0.99; Adam learning rates 3×1043 \times 10^{-4}.
  • Target smoothing: τ=1/8000\tau = 1/8000; reward clipping [1,1][-1, 1]; initial random actions: 20,000 steps; target entropy Hˉ=0.98×(log(1/A))\bar{H}=0.98 \times (-\log(1/|A|)).

SAC-Discrete outperformed Rainbow in 10/20 games; median relative performance 1-1\%, maximum +4330+4330\%, minimum 99-99\%. Even without hyperparameter tuning, it matched Rainbow’s sample efficiency in the low-data regime, attributing robust sample efficiency and stability to the combination of twin critics and entropy regularization (Christodoulou, 2019).

5. Failure Modes and Advances in Discrete SAC

Two prominent issues arise in vanilla discrete SAC:

  • Q-value underestimation: The use of “min” in the Q-update target, when combined with a sum over actions, introduces downward bias via Jensen’s inequality. This can collapse Q-functions and result in near-uniform policies and unstable training.
  • Performance instability: As policy updates incorporate the policy's own logits, vanishing Q-values induce high-variance training and poor convergence in sparsely sampled states (Zhou et al., 2022).

Stable Discrete SAC (SDSAC) proposes:

  • Entropy-penalty modification (using a penalty, not a bonus) in the actor objective.
  • Double-average Q-learning: Targets use the average of twin Q-networks instead of their minimum, mitigating downward bias.
  • Q-clip mechanism: TD-targets are clipped towards the recent average to prevent rare bootstrap-induced outliers. This variant provides 200–400% higher final returns on 13/18 Atari games and significantly improved learning stability (Zhou et al., 2022).

6. Implementation Details and Algorithmic Pseudocode

A practitioner-implementable version of SAC-Discrete consists of:

  • Replay buffer DD; two Q-networks Qθ1Q_{\theta_1}, Qθ2Q_{\theta_2}; target Qs; policy πϕ:SΔA\pi_\phi: S \to \Delta^{|A|}; temperature α\alpha (learned or fixed).
  • For each interaction step: sample action atπϕ(st)a_t \sim \pi_\phi(\cdot|s_t), observe (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1}), store in DD.
  • For each gradient step:

    1. Sample minibatch from DD.
    2. Compute V(s)V(s') and Q-targets exactly as described above.
    3. Critic update: minimize JQ(θi)J_Q(\theta_i) over samples for i=1,2i=1,2.
    4. Policy update: minimize Jπ(ϕ)J_\pi(\phi) using matrix-vectorized sum over all actions.
    5. Temperature α\alpha update if adaptive.
    6. Polyak average for target Q update (Christodoulou, 2019).

Distinctive aspects of this workflow are the reliance on all-actions exact computation rather than stochastic estimation and the avoidance of reparameterization.

7. Significance, Limitations, and Extensions

SAC and its discrete variants established state-of-the-art performance in model-free RL for both continuous and discrete domains, powered by the maximum-entropy principle and off-policy data utilization (Haarnoja et al., 2018, Christodoulou, 2019). The exact action-expectation infrastructure of SAC-Discrete yields robust, low-variance updates applicable to high-dimensional discrete control.

However, in discrete settings, special attention must be paid to the interaction between double Q-learning and entropy bonuses to avoid underestimation and instability (Zhou et al., 2022). Methods such as entropy-penalty actor losses, double-average critics, and target clipping are empirically validated to circumvent these pathologies.

The continued evolution of SAC-inspired algorithms reflects ongoing interest in trustworthy, stable, off-policy RL under both continuous and discrete action modes.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft Actor Critic Algorithm.