Papers
Topics
Authors
Recent
2000 character limit reached

Soft Actor-Critic Algorithm

Updated 24 December 2025
  • Soft Actor-Critic is a model-free, off-policy deep RL algorithm that maximizes both return and policy entropy for enhanced exploration and robust performance.
  • It leverages soft policy iteration with twin Q-networks and automatic temperature tuning, ensuring stable training across diverse continuous control tasks.
  • Empirical evaluations on benchmarks like Hopper and Minitaur confirm SAC’s superior sample efficiency, low variance, and resilience to hyperparameter variations.

Soft Actor-Critic (SAC) is a model-free, off-policy deep reinforcement learning algorithm formulated within the maximum entropy RL framework. SAC simultaneously seeks to maximize both expected return and the entropy of the policy at each time step, enabling more robust exploration and superior sample efficiency relative to traditional actor-critic methods. SAC achieves state-of-the-art performance on challenging continuous control benchmarks and demonstrates stability across a range of hyperparameters and random seeds, making it a reference method for continuous and, via extensions, discrete domains (Haarnoja et al., 2018).

1. Maximum Entropy Framework and Objective

SAC operates on a Markov decision process (S,A,P,r,γ)(\mathcal{S}, \mathcal{A}, P, r, \gamma) with continuous state and action spaces. Standard RL maximizes the expected discounted return:

Jstd(π)=Eρπ[t=0γtr(st,at)]J_{\text{std}}(\pi) = \mathbb{E}_{\rho_\pi}\left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right]

SAC's maximum entropy formulation augments this objective with an entropy bonus at every step:

J(π)=Eρπ[t=0γt(r(st,at)+αH(π(st)))]J(\pi) = \mathbb{E}_{\rho_\pi} \left[ \sum_{t=0}^\infty \gamma^t \left( r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right) \right]

where H(π(s))=Aπ(as)logπ(as)da\mathcal{H}(\pi(\cdot|s)) = -\int_{\mathcal{A}} \pi(a|s) \log \pi(a|s) \mathrm{d}a, and α>0\alpha > 0 is the temperature parameter trading off reward and stochasticity. Maximizing entropy (via α\alpha) encourages temporally coherent, diverse exploratory behavior, which is crucial for sample efficiency and robustness in complex domains (Haarnoja et al., 2018).

2. Soft Policy Iteration: Evaluation and Improvement

SAC is grounded in the soft policy iteration paradigm:

  • Soft policy evaluation: For a fixed stochastic policy π\pi, the "soft" Q-value and value function are:

Qπ(s,a)=r(s,a)+γEsPVπ(s),Vπ(s)=Eaπ[Qπ(s,a)αlogπ(as)]Q^\pi(s, a) = r(s, a) + \gamma \mathbb{E}_{s'\sim P} V^\pi(s'), \quad V^\pi(s) = \mathbb{E}_{a\sim\pi} [ Q^\pi(s, a) - \alpha \log \pi(a|s) ]

The soft Bellman operator:

(TπQ)(s,a)=r(s,a)+γEsEaπ[Q(s,a)αlogπ(as)](\mathcal{T}^\pi Q)(s, a) = r(s,a) + \gamma \mathbb{E}_{s'} \mathbb{E}_{a'\sim\pi} [ Q(s', a') - \alpha \log \pi(a'|s') ]

iterated until convergence yields QπQ^\pi.

  • Soft policy improvement: Given QπQ^\pi, π\pi is updated by minimizing the KL divergence to the Boltzmann distribution induced by QπQ^\pi:

πnew(s)=argminπΠDKL(π(s)    exp(Qπ(s,)/α)Z(s))\pi_{\text{new}}(\cdot|s) = \arg\min_{\pi' \in \Pi} D_{\mathrm{KL}} \left( \pi'(\cdot|s) \; \bigg\| \; \frac{\exp(Q^\pi(s, \cdot)/\alpha)}{Z(s)} \right)

In the parametric setting (Gaussian policy), the actor is updated via stochastic gradients on the loss:

Jπ(θπ)=EsD,aπθπ[αlogπθπ(as)miniQϕi(s,a)]J_\pi(\theta_\pi) = \mathbb{E}_{s\sim\mathcal{D},\, a\sim\pi_{\theta_\pi}} [ \alpha \log \pi_{\theta_\pi}(a|s) - \min_{i} Q_{\phi_i}(s, a) ]

with actions a=fθπ(ϵ;s),  ϵN(0,I)a = f_{\theta_\pi}(\epsilon; s),\; \epsilon \sim \mathcal{N}(0, I) by reparameterization (Haarnoja et al., 2018).

  • Critic (Q-Function) update: Two Q-networks with "double Q-learning" are fitted to targets

y=r+γ[minj=1,2Qϕˉj(s,a)αlogπ(as)], aπ(s)y = r + \gamma \left [ \min_{j=1,2} Q_{\bar\phi_j}(s', a') - \alpha \log \pi(a' | s') \right ],\ a'\sim \pi(\cdot|s')

using MSE loss.

3. Temperature Parameter and Its Automatic Tuning

SAC introduces an automatic scheme for updating α\alpha such that the average policy entropy matches a target Hˉ\bar{\mathcal{H}}. The constrained maximization problem:

maxπEρπ[tr(st,at)]subject toE[logπ(atst)]Hˉ\max_\pi \mathbb{E}_{\rho_\pi} \left [ \sum_t r(s_t, a_t) \right ]\quad \text{subject to}\quad \mathbb{E}[-\log \pi(a_t|s_t)] \geq \bar{\mathcal{H}}

yields a dual framework where the temperature is tuned via dual gradient descent:

J(α)=Es,aπ[α(logπ(as)+Hˉ)]ααλααJ(α)J(\alpha) = \mathbb{E}_{s, a\sim \pi} [ -\alpha (\log \pi(a|s) + \bar{\mathcal{H}} ) ] \qquad \alpha \leftarrow \alpha - \lambda_\alpha \nabla_\alpha J(\alpha)

This auto-tuning eliminates the need for task-specific entropy-regularization schedules and stabilizes training (Haarnoja et al., 2018).

4. Algorithm Details and Practical Implementation

SAC operates entirely off-policy using stochastic gradient updates over replay buffer transitions, with sample-efficient reuse of data. Implementation recommendations include:

  • Architecture: Two Q-networks, a Gaussian policy network, target networks. Each uses 2 hidden layers, 256 ReLU neurons.
  • Action bounding: Actions are sampled as uN(μ,σ)u \sim \mathcal{N}(\mu, \sigma), a=tanh(u)a = \tanh(u), with the log-density corrected by ilog(1tanh(ui)2)-\sum_i \log(1 - \tanh(u_i)^2).
  • Optimization: Adam optimizer, learning rates λQ=λπ=λα=3×104\lambda_Q = \lambda_\pi = \lambda_\alpha = 3 \times 10^{-4}.
  • Replay Buffer: Size 1×1061\times 10^6, batch size $256$.
  • Polyak averaging: τ=0.005\tau = 0.005 for target networks.

Table: Core Steps of the SAC Algorithm

Step Update Objective / Formula
Critic ϕiϕiλQϕiJQ\phi_i \leftarrow \phi_i - \lambda_Q \nabla_{\phi_i} J_Q JQ(ϕi)=E[12(Qϕiy)2]J_Q(\phi_i)=\mathbb{E}[ \frac{1}{2} (Q_{\phi_i} - y)^2 ]
Actor θπθπλπθπJπ\theta_\pi \leftarrow \theta_\pi - \lambda_\pi \nabla_{\theta_\pi} J_\pi Jπ(θπ)=E[αlogπ(as)miniQϕi(s,a)]J_\pi(\theta_\pi)=\mathbb{E}[ \alpha \log \pi(a|s) - \min_i Q_{\phi_i}(s, a)]
Temperature ααλααJ(α)\alpha \leftarrow \alpha - \lambda_\alpha\nabla_\alpha J(\alpha) J(α)=E[α(logπ(as)+Hˉ)]J(\alpha)=\mathbb{E}[ -\alpha ( \log \pi(a|s) + \bar{\mathcal{H}} ) ]
Targets ϕˉiτϕi+(1τ)ϕˉi\bar \phi_i \leftarrow \tau \phi_i + (1-\tau) \bar\phi_i Polyak averaging

SAC's off-policy construction enables a high update-to-data ratio, with robust sample reuse and low sensitivity to buffer staleness (Haarnoja et al., 2018).

5. Empirical Evaluation and Benchmarks

SAC achieves superior sample efficiency and final performance on continuous control domains, including:

  • Simulated tasks: OpenAI Gym/rllab environments—Hopper, Walker2d, HalfCheetah, Ant, Humanoid. SAC matches or outperforms DDPG, PPO, Soft Q-Learning, and TD3 in both learning speed and asymptotic return, with notably reduced performance variance across seeds. The agent is less sensitive to hyperparameters compared to alternatives.
  • Real-world robotics: Demonstrated on Minitaur quadruped locomotion (learns in ~2 hours, ~160k steps, generalizes to unseen terrains), and on dexterous hand manipulation with raw RGB vision (achieves robust valve rotation in ∼300k steps, and non-vision variant in ∼3 hours). Notably, no per-task hyperparameter tuning was needed, facilitating direct transfer to physical platforms (Haarnoja et al., 2018).

6. Robustness and Practical Significance

Due to the maximum entropy framework, twin Q-networks, automatic temperature adaptation, and purely off-policy optimization, SAC exhibits:

  • Stability: Low variance across random seeds and robustness to hyperparameter changes.
  • Exploration: Entropy regularization induces broad state-action visitation, especially critical in sparse reward or over-parameterized domains.
  • Sample efficiency: High learning efficiency is maintained without the drawbacks of on-policy data constraints, making SAC particularly effective for real-world robotics and any domain where data acquisition is expensive or slow.

In summary, SAC provides an algorithmic foundation for high-performance, robust, and efficient RL in continuous control, validated empirically on both simulated and real physical agents, and serves as an extensible basis for subsequent research in maximum entropy RL and deep actor-critic architectures (Haarnoja et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Soft Actor-Critic Algorithm.