T-SAC: Sequence-Aware Soft Actor-Critic

Updated 1 January 2026

T-SAC Algorithm is a reinforcement learning method that incorporates temporal sequence modeling using GRUs or transformers to capture historical context.
It improves performance, robustness, and generalization in complex sequential decision-making tasks, notably for engine control in hybrid vehicles.
By replacing standard feedforward networks with sequence-aware architectures, T-SAC enables efficient handling of long-horizon dependencies and dynamic system variations.

T-SAC (Sequence-Aware Soft Actor-Critic) refers to a class of algorithms that enhance the Soft Actor-Critic (SAC) framework for reinforcement learning (RL) by equipping both actor and critic networks with explicit mechanisms for temporal sequence modeling. The explicit goal is to improve performance, robustness, and generalization in challenging real-world sequential decision-making tasks—most notably, engine control optimization in hybrid electric vehicle (HEV) powertrains—by leveraging either recurrent neural networks or attention-based models to capture temporal dependencies in the policy and value functions (Jaleel et al., 6 Aug 2025).

1. Problem Setting and Motivation

T-SAC targets control problems that exhibit strong temporal dependencies, as is typical in electrified powertrain management for series-hybrid electric vehicles. These systems require coordinated decision-making over extended horizons to optimize fuel consumption and maintain desirable battery state-of-charge (SOC) across variable operating conditions. The high-dimensional, continuous control space (e.g., engine speed and torque) compounds the challenge, and the returns at any step depend non-trivially on the sequence of prior states and actions. Standard SAC algorithms (using feedforward networks as encoders) fail to retain adequate temporal context, treating each observation as i.i.d., and thus often underperform in such settings.

In T-SAC, the SAC algorithm is augmented such that both the policy ("actor") and value function ("critic") networks are sequence-aware: specifically, by integrating Gated Recurrent Units (GRUs) or Decision Transformers (DTs). This enables the networks to encode and reason over relevant historical and predictive context (Jaleel et al., 6 Aug 2025).

2. Markov Decision Process Formulation and Losses

The engine control problem is posed as an infinite-horizon Markov Decision Process with:

State $s_t$ : SOC, cumulative distance $D_t$ , electric machine power demand $P_{EM,t}$
Action $a_t$ : engine speed $\omega_{eng,t}$ and torque $T_{eng,t}$
Transition $p(s_{t+1}|s_t,a_t)$ : energy-balance SIMULINK model
Reward $r_t$ : penalizes fuel rate $r^{fuel}_t = - w_{fuel}\cdot FuelRate(a_t)\cdot SOC_{init}^2$ and penalizes/encourages SOC based on deviations from [15%, 85%] using piecewise weights
Inputs/outputs normalized to $[-1,1]$

Baseline SAC maximizes expected cumulative reward plus entropy:

$J(\pi) = \sum_t \mathbb{E}_{s_t,a_t\sim\pi}[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))]$

The actor and critic are trained via standard policy and Q-function losses (including double Q for overestimation bias) and a temperature loss that adapts $\alpha$ :

$J_\pi(\theta) = \mathbb{E}_{s_t,a_t\sim\pi_\theta}[ \alpha \log \pi_\theta(a_t|s_t) - Q_\phi(s_t,a_t) ]$

$J_Q(\phi_i) = \mathbb{E}_{(s_t,a_t,r_t,s_{t+1}) \sim D} \left[ Q_{\phi_i}(s_t,a_t) - y_t \right]^2$

$y_t = r_t + \gamma \mathbb{E}_{a_{t+1} \sim\pi_\theta}[ \min_{j=1,2} Q_{\bar\phi_j}(s_{t+1}, a_{t+1}) - \alpha \log \pi_\theta(a_{t+1}|s_{t+1}) ]$

3. Sequence-Aware Architectures

T-SAC augments the SAC framework by replacing feedforward actor and critic mappings with temporal sequence modeling modules:

GRU-Augmented Actor and Critic

Actor: 2-layer GRU ( $h_t$ $h_{t}$ ); input is concatenated $[s_t, a_{t-1}]$ $[s_{t}, a_{t - 1}]$ .
- At each step:
- $h_t = \mathrm{GRUCell}([s_t, a_{t-1}], h_{t-1})$
- $\mu_t, \log \sigma_t = \mathrm{Linear}(h_t)$
- Action sampled $a_t \sim \mathcal{N}(\mu_t, \sigma_t)$ (tanh-squashed)
Critic: 2-layer GRU, input $[s_i, a_i]$ for each time index $i$ in a sequence; outputs $Q_i$ for each $i$ .
Loss: critic loss is the sum of mean-squared errors across all steps in the sampled sequence.

Decision Transformer (DT) Actor and Critic

Tokens: Each token consists of $(R_j, s_j, a_j)$ , i.e., return-to-go, state, action.
Embedding layer: Projects tokens to $d_{model} = 128$ with positional encoding.
Transformer: single-layer, 4-head causal self-attention; outputs action parameters.
Critic: predicts next return $R_{t+1}$ given tokens up to current time; loss is MSE between forecast and target return.

Sequence length $k$ is a key hyperparameter: GRUs use $k=10$ (effective for short-term dependencies), DTs use $k=100$ (for long-range context).

4. Algorithmic Workflow and Training Considerations

High-level training pseudocode for T-SAC:

Initialize networks ( $\pi_\theta$ , $Q_{\phi_1}$ , $Q_{\phi_2}$ ), targets, buffer $D$ .
At each env step $t$ $t$ :
- For GRU/DT: if $t < k$ , pad sequence start.
- Extract sequence $\tau_t = \{ s_{t-k:t}, a_{t-k:t-1}, [R_j] \ \text{for DT}\}$ .
- Compute policy action $a_t$ via GRU or transformer given $\tau_t$ .
- Step environment, observe $r_t$ , $s_{t+1}$ , store trajectory in $D$ .
Each update:
- Sample $N$ sequences (GRU/DT) or transitions (FFN) from $D$ .
- Compute critic loss over sequence (GRU/DT) or per transition (FFN).
- Compute actor loss and temperature loss.
- Apply gradient update; softly update target Q.

Primary hyperparameters:

Hyperparameter	FFN	GRU	DT
Learning rate (actor/critic)	$1{\times}10^{-4}$	$1{\times}10^{-4}$	$1{\times}10^{-4}$
Batch size $N$	64	64	64
Sequence length $k$	–	10	100
Actor/critic architecture	[128,128]	2-layer GRU(128)	1-layer Transformer (128)
Entropy $\alpha$ (auto-tuned)	Yes	Yes	Yes
Gradient clipping	None	$0.25$	None
Training frequency	every 5	every 25	every 50 env steps

5. Empirical Results and Comparative Performance

Performance is benchmarked on Highway Fuel Economy Test (HFET), US06, and Heavy Heavy-Duty Diesel Truck (HHDDT) cycles using a MATLAB/Simulink forward model for the series HEV system. Dynamic Programming (DP) provides a strong baseline for minimum fuel penalty.

Main Performance Table

Cycle	Metric	DP	FFN SAC	ΔFFN %	GRU SAC	ΔGRU %	DT SAC	ΔDT %
HFET	MPG	23.71	20.73	–12.6	21.07	–11.1	21.68	–8.5
HFET	SOC $_f$	15.55%	15.81%	+1.7	15.10%	–2.9	15.38%	–1.1
US06	MPG	4.63	4.27	–7.7	4.43	–4.2	4.04	–12.7
US06	SOC $_f$	16.44%	14.67%	–10.7	15.63%	–4.9	17.58%	+6.9
HHDDT	MPG	21.83	18.82	–13.8	19.04	–12.8	20.75	–4.9
HHDDT	SOC $_f$	16.45%	17.29%	+5.1	15.59%	–5.2	15.23%	–7.4

On HFET, DT-SAC is within 1.8% fuel consumption of the DP baseline. GRU-SAC is within 3.16%, and FFN within 3.43%.
On held-out US06 and HHDDT cycles, sequence-aware agents (GRU, DT) exhibited superior generalization, with DT-SAC showing the best alignment with the DP reference.
Ablation studies confirm that sequence-aware architectures train faster and are more robust to variations in SOC, drive cycle length, and input sequence parameters.

6. Significance of Sequence Modeling Approaches

Sequence modeling is essential in engine control tasks due to the propagation of state (especially SOC) and the stochasticity in power demand. Key distinctions:

Feedforward networks: State/action evaluation is i.i.d.; fails to leverage sequential dependencies in system dynamics.
GRUs: Efficiently encode short- and mid-term temporal dependencies via hidden-state recurrence; performance saturates with longer context windows.
Decision Transformers: Utilize self-attention across long sequences, enabling the agent to capture complex, potentially long-horizon cause-effect relations ("return-to-go" context modeling). DTs achieve higher sample complexity but are more computationally demanding, and can introduce output variability ("jitter") over long sequences.

Potential architectural extensions include prioritized/sequential replay, n-step returns in transformer-critics, LSTM or Temporal Convolutional Networks, world models for model-based SAC, and policy distillation for real-time embedded inference.

7. Extensions, Applications, and Outlook

T-SAC's modular sequence-aware enhancements to the classical SAC agent are directly applicable to any RL domain that exhibits strong temporal dependencies and variable outcome propagation, beyond powertrain control—for example, in robotics, process control, and networked systems. Efficacy is contingent on matching sequence model complexity (e.g., GRU vs DT) to scenario-specific memory requirements and hardware constraints.

Given T-SAC’s robust empirical generalization under unseen drive cycles and stochastic battery initializations, a plausible implication is its potential utility in production-ready HEV controllers subject to regulatory fuel economy constraints and hardware-in-the-loop adaptation. Ongoing research directions include exploring memory-augmented networks, cross-modal context fusion, and distillation for embedded real-time deployment (Jaleel et al., 6 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Sequence Aware SAC Control for Engine Fuel Consumption Optimization in Electrified Powertrain (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to T-SAC Algorithm.

T-SAC: Sequence-Aware Soft Actor-Critic

1. Problem Setting and Motivation

2. Markov Decision Process Formulation and Losses

3. Sequence-Aware Architectures

GRU-Augmented Actor and Critic

Decision Transformer (DT) Actor and Critic

4. Algorithmic Workflow and Training Considerations

5. Empirical Results and Comparative Performance

Main Performance Table

6. Significance of Sequence Modeling Approaches

7. Extensions, Applications, and Outlook

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

T-SAC: Sequence-Aware Soft Actor-Critic

1. Problem Setting and Motivation

2. Markov Decision Process Formulation and Losses

3. Sequence-Aware Architectures

GRU-Augmented Actor and Critic

Decision Transformer (DT) Actor and Critic

4. Algorithmic Workflow and Training Considerations

5. Empirical Results and Comparative Performance

Main Performance Table

6. Significance of Sequence Modeling Approaches

7. Extensions, Applications, and Outlook

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research