Soft Actor-Critic Algorithm

Updated 24 December 2025

Soft Actor-Critic is a model-free, off-policy deep RL algorithm that maximizes both return and policy entropy for enhanced exploration and robust performance.
It leverages soft policy iteration with twin Q-networks and automatic temperature tuning, ensuring stable training across diverse continuous control tasks.
Empirical evaluations on benchmarks like Hopper and Minitaur confirm SAC’s superior sample efficiency, low variance, and resilience to hyperparameter variations.

Soft Actor-Critic (SAC) is a model-free, off-policy deep reinforcement learning algorithm formulated within the maximum entropy RL framework. SAC simultaneously seeks to maximize both expected return and the entropy of the policy at each time step, enabling more robust exploration and superior sample efficiency relative to traditional actor-critic methods. SAC achieves state-of-the-art performance on challenging continuous control benchmarks and demonstrates stability across a range of hyperparameters and random seeds, making it a reference method for continuous and, via extensions, discrete domains (Haarnoja et al., 2018).

1. Maximum Entropy Framework and Objective

SAC operates on a Markov decision process $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ with continuous state and action spaces. Standard RL maximizes the expected discounted return:

$J_{\text{std}}(\pi) = \mathbb{E}_{\rho_\pi}\left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right]$

SAC's maximum entropy formulation augments this objective with an entropy bonus at every step:

$J(\pi) = \mathbb{E}_{\rho_\pi} \left[ \sum_{t=0}^\infty \gamma^t \left( r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right) \right]$

where $\mathcal{H}(\pi(\cdot|s)) = -\int_{\mathcal{A}} \pi(a|s) \log \pi(a|s) \mathrm{d}a$ , and $\alpha > 0$ is the temperature parameter trading off reward and stochasticity. Maximizing entropy (via $\alpha$ ) encourages temporally coherent, diverse exploratory behavior, which is crucial for sample efficiency and robustness in complex domains (Haarnoja et al., 2018).

2. Soft Policy Iteration: Evaluation and Improvement

SAC is grounded in the soft policy iteration paradigm:

Soft policy evaluation: For a fixed stochastic policy $\pi$ , the "soft" Q-value and value function are:

$Q^\pi(s, a) = r(s, a) + \gamma \mathbb{E}_{s'\sim P} V^\pi(s'), \quad V^\pi(s) = \mathbb{E}_{a\sim\pi} [ Q^\pi(s, a) - \alpha \log \pi(a|s) ]$

The soft Bellman operator:

$(\mathcal{T}^\pi Q)(s, a) = r(s,a) + \gamma \mathbb{E}_{s'} \mathbb{E}_{a'\sim\pi} [ Q(s', a') - \alpha \log \pi(a'|s') ]$

iterated until convergence yields $Q^\pi$ .

Soft policy improvement: Given $Q^\pi$ , $\pi$ is updated by minimizing the KL divergence to the Boltzmann distribution induced by $Q^\pi$ :

$\pi_{\text{new}}(\cdot|s) = \arg\min_{\pi' \in \Pi} D_{\mathrm{KL}} \left( \pi'(\cdot|s) \; \bigg\| \; \frac{\exp(Q^\pi(s, \cdot)/\alpha)}{Z(s)} \right)$

In the parametric setting (Gaussian policy), the actor is updated via stochastic gradients on the loss:

$J_\pi(\theta_\pi) = \mathbb{E}_{s\sim\mathcal{D},\, a\sim\pi_{\theta_\pi}} [ \alpha \log \pi_{\theta_\pi}(a|s) - \min_{i} Q_{\phi_i}(s, a) ]$

with actions $a = f_{\theta_\pi}(\epsilon; s),\; \epsilon \sim \mathcal{N}(0, I)$ by reparameterization (Haarnoja et al., 2018).

Critic (Q-Function) update: Two Q-networks with "double Q-learning" are fitted to targets

$y = r + \gamma \left [ \min_{j=1,2} Q_{\bar\phi_j}(s', a') - \alpha \log \pi(a' | s') \right ],\ a'\sim \pi(\cdot|s')$

using MSE loss.

3. Temperature Parameter and Its Automatic Tuning

SAC introduces an automatic scheme for updating $\alpha$ such that the average policy entropy matches a target $\bar{\mathcal{H}}$ . The constrained maximization problem:

$\max_\pi \mathbb{E}_{\rho_\pi} \left [ \sum_t r(s_t, a_t) \right ]\quad \text{subject to}\quad \mathbb{E}[-\log \pi(a_t|s_t)] \geq \bar{\mathcal{H}}$

yields a dual framework where the temperature is tuned via dual gradient descent:

$J(\alpha) = \mathbb{E}_{s, a\sim \pi} [ -\alpha (\log \pi(a|s) + \bar{\mathcal{H}} ) ] \qquad \alpha \leftarrow \alpha - \lambda_\alpha \nabla_\alpha J(\alpha)$

This auto-tuning eliminates the need for task-specific entropy-regularization schedules and stabilizes training (Haarnoja et al., 2018).

4. Algorithm Details and Practical Implementation

SAC operates entirely off-policy using stochastic gradient updates over replay buffer transitions, with sample-efficient reuse of data. Implementation recommendations include:

Architecture: Two Q-networks, a Gaussian policy network, target networks. Each uses 2 hidden layers, 256 ReLU neurons.
Action bounding: Actions are sampled as $u \sim \mathcal{N}(\mu, \sigma)$ , $a = \tanh(u)$ , with the log-density corrected by $-\sum_i \log(1 - \tanh(u_i)^2)$ .
Optimization: Adam optimizer, learning rates $\lambda_Q = \lambda_\pi = \lambda_\alpha = 3 \times 10^{-4}$ .
Replay Buffer: Size $1\times 10^6$ , batch size $256$.
Polyak averaging: $\tau = 0.005$ for target networks.

Table: Core Steps of the SAC Algorithm

Step	Update	Objective / Formula
Critic	$\phi_i \leftarrow \phi_i - \lambda_Q \nabla_{\phi_i} J_Q$	$J_Q(\phi_i)=\mathbb{E}[ \frac{1}{2} (Q_{\phi_i} - y)^2 ]$
Actor	$\theta_\pi \leftarrow \theta_\pi - \lambda_\pi \nabla_{\theta_\pi} J_\pi$	$J_\pi(\theta_\pi)=\mathbb{E}[ \alpha \log \pi(a\|s) - \min_i Q_{\phi_i}(s, a)]$
Temperature	$\alpha \leftarrow \alpha - \lambda_\alpha\nabla_\alpha J(\alpha)$	$J(\alpha)=\mathbb{E}[ -\alpha ( \log \pi(a\|s) + \bar{\mathcal{H}} ) ]$
Targets	$\bar \phi_i \leftarrow \tau \phi_i + (1-\tau) \bar\phi_i$	Polyak averaging

SAC's off-policy construction enables a high update-to-data ratio, with robust sample reuse and low sensitivity to buffer staleness (Haarnoja et al., 2018).

5. Empirical Evaluation and Benchmarks

SAC achieves superior sample efficiency and final performance on continuous control domains, including:

Simulated tasks: OpenAI Gym/rllab environments—Hopper, Walker2d, HalfCheetah, Ant, Humanoid. SAC matches or outperforms DDPG, PPO, Soft Q-Learning, and TD3 in both learning speed and asymptotic return, with notably reduced performance variance across seeds. The agent is less sensitive to hyperparameters compared to alternatives.
Real-world robotics: Demonstrated on Minitaur quadruped locomotion (learns in ~2 hours, ~160k steps, generalizes to unseen terrains), and on dexterous hand manipulation with raw RGB vision (achieves robust valve rotation in ∼300k steps, and non-vision variant in ∼3 hours). Notably, no per-task hyperparameter tuning was needed, facilitating direct transfer to physical platforms (Haarnoja et al., 2018).

6. Robustness and Practical Significance

Due to the maximum entropy framework, twin Q-networks, automatic temperature adaptation, and purely off-policy optimization, SAC exhibits:

Stability: Low variance across random seeds and robustness to hyperparameter changes.
Exploration: Entropy regularization induces broad state-action visitation, especially critical in sparse reward or over-parameterized domains.
Sample efficiency: High learning efficiency is maintained without the drawbacks of on-policy data constraints, making SAC particularly effective for real-world robotics and any domain where data acquisition is expensive or slow.

In summary, SAC provides an algorithmic foundation for high-performance, robust, and efficient RL in continuous control, validated empirically on both simulated and real physical agents, and serves as an extensible basis for subsequent research in maximum entropy RL and deep actor-critic architectures (Haarnoja et al., 2018).

PDF Markdown Chat (Pro)

References (1)

Soft Actor-Critic Algorithms and Applications (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Soft Actor-Critic Algorithm.