Papers
Topics
Authors
Recent
2000 character limit reached

T-SAC: Sequence-Aware Soft Actor-Critic

Updated 1 January 2026
  • T-SAC Algorithm is a reinforcement learning method that incorporates temporal sequence modeling using GRUs or transformers to capture historical context.
  • It improves performance, robustness, and generalization in complex sequential decision-making tasks, notably for engine control in hybrid vehicles.
  • By replacing standard feedforward networks with sequence-aware architectures, T-SAC enables efficient handling of long-horizon dependencies and dynamic system variations.

T-SAC (Sequence-Aware Soft Actor-Critic) refers to a class of algorithms that enhance the Soft Actor-Critic (SAC) framework for reinforcement learning (RL) by equipping both actor and critic networks with explicit mechanisms for temporal sequence modeling. The explicit goal is to improve performance, robustness, and generalization in challenging real-world sequential decision-making tasks—most notably, engine control optimization in hybrid electric vehicle (HEV) powertrains—by leveraging either recurrent neural networks or attention-based models to capture temporal dependencies in the policy and value functions (Jaleel et al., 6 Aug 2025).

1. Problem Setting and Motivation

T-SAC targets control problems that exhibit strong temporal dependencies, as is typical in electrified powertrain management for series-hybrid electric vehicles. These systems require coordinated decision-making over extended horizons to optimize fuel consumption and maintain desirable battery state-of-charge (SOC) across variable operating conditions. The high-dimensional, continuous control space (e.g., engine speed and torque) compounds the challenge, and the returns at any step depend non-trivially on the sequence of prior states and actions. Standard SAC algorithms (using feedforward networks as encoders) fail to retain adequate temporal context, treating each observation as i.i.d., and thus often underperform in such settings.

In T-SAC, the SAC algorithm is augmented such that both the policy ("actor") and value function ("critic") networks are sequence-aware: specifically, by integrating Gated Recurrent Units (GRUs) or Decision Transformers (DTs). This enables the networks to encode and reason over relevant historical and predictive context (Jaleel et al., 6 Aug 2025).

2. Markov Decision Process Formulation and Losses

The engine control problem is posed as an infinite-horizon Markov Decision Process with:

  • State sts_t: SOC, cumulative distance DtD_t, electric machine power demand PEM,tP_{EM,t}
  • Action ata_t: engine speed ωeng,t\omega_{eng,t} and torque Teng,tT_{eng,t}
  • Transition p(st+1st,at)p(s_{t+1}|s_t,a_t): energy-balance SIMULINK model
  • Reward rtr_t: penalizes fuel rate rtfuel=wfuelFuelRate(at)SOCinit2r^{fuel}_t = - w_{fuel}\cdot FuelRate(a_t)\cdot SOC_{init}^2 and penalizes/encourages SOC based on deviations from [15%, 85%] using piecewise weights
  • Inputs/outputs normalized to [1,1][-1,1]

Baseline SAC maximizes expected cumulative reward plus entropy:

J(π)=tEst,atπ[r(st,at)+αH(π(st))]J(\pi) = \sum_t \mathbb{E}_{s_t,a_t\sim\pi}[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))]

The actor and critic are trained via standard policy and Q-function losses (including double Q for overestimation bias) and a temperature loss that adapts α\alpha:

Jπ(θ)=Est,atπθ[αlogπθ(atst)Qϕ(st,at)]J_\pi(\theta) = \mathbb{E}_{s_t,a_t\sim\pi_\theta}[ \alpha \log \pi_\theta(a_t|s_t) - Q_\phi(s_t,a_t) ]

JQ(ϕi)=E(st,at,rt,st+1)D[Qϕi(st,at)yt]2J_Q(\phi_i) = \mathbb{E}_{(s_t,a_t,r_t,s_{t+1}) \sim D} \left[ Q_{\phi_i}(s_t,a_t) - y_t \right]^2

yt=rt+γEat+1πθ[minj=1,2Qϕˉj(st+1,at+1)αlogπθ(at+1st+1)]y_t = r_t + \gamma \mathbb{E}_{a_{t+1} \sim\pi_\theta}[ \min_{j=1,2} Q_{\bar\phi_j}(s_{t+1}, a_{t+1}) - \alpha \log \pi_\theta(a_{t+1}|s_{t+1}) ]

3. Sequence-Aware Architectures

T-SAC augments the SAC framework by replacing feedforward actor and critic mappings with temporal sequence modeling modules:

GRU-Augmented Actor and Critic

  • Actor: 2-layer GRU (hth_t); input is concatenated [st,at1][s_t, a_{t-1}].
    • At each step:
    • ht=GRUCell([st,at1],ht1)h_t = \mathrm{GRUCell}([s_t, a_{t-1}], h_{t-1})
    • μt,logσt=Linear(ht)\mu_t, \log \sigma_t = \mathrm{Linear}(h_t)
    • Action sampled atN(μt,σt)a_t \sim \mathcal{N}(\mu_t, \sigma_t) (tanh-squashed)
  • Critic: 2-layer GRU, input [si,ai][s_i, a_i] for each time index ii in a sequence; outputs QiQ_i for each ii.
  • Loss: critic loss is the sum of mean-squared errors across all steps in the sampled sequence.

Decision Transformer (DT) Actor and Critic

  • Tokens: Each token consists of (Rj,sj,aj)(R_j, s_j, a_j), i.e., return-to-go, state, action.
  • Embedding layer: Projects tokens to dmodel=128d_{model} = 128 with positional encoding.
  • Transformer: single-layer, 4-head causal self-attention; outputs action parameters.
  • Critic: predicts next return Rt+1R_{t+1} given tokens up to current time; loss is MSE between forecast and target return.

Sequence length kk is a key hyperparameter: GRUs use k=10k=10 (effective for short-term dependencies), DTs use k=100k=100 (for long-range context).

4. Algorithmic Workflow and Training Considerations

High-level training pseudocode for T-SAC:

  1. Initialize networks (πθ\pi_\theta, Qϕ1Q_{\phi_1}, Qϕ2Q_{\phi_2}), targets, buffer DD.
  2. At each env step tt:
    • For GRU/DT: if t<kt < k, pad sequence start.
    • Extract sequence τt={stk:t,atk:t1,[Rj] for DT}\tau_t = \{ s_{t-k:t}, a_{t-k:t-1}, [R_j] \ \text{for DT}\}.
    • Compute policy action ata_t via GRU or transformer given τt\tau_t.
    • Step environment, observe rtr_t, st+1s_{t+1}, store trajectory in DD.
  3. Each update:
    • Sample NN sequences (GRU/DT) or transitions (FFN) from DD.
    • Compute critic loss over sequence (GRU/DT) or per transition (FFN).
    • Compute actor loss and temperature loss.
    • Apply gradient update; softly update target Q.

Primary hyperparameters:

Hyperparameter FFN GRU DT
Learning rate (actor/critic) 1×1041{\times}10^{-4} 1×1041{\times}10^{-4} 1×1041{\times}10^{-4}
Batch size NN 64 64 64
Sequence length kk 10 100
Actor/critic architecture [128,128] 2-layer GRU(128) 1-layer Transformer (128)
Entropy α\alpha (auto-tuned) Yes Yes Yes
Gradient clipping None $0.25$ None
Training frequency every 5 every 25 every 50 env steps

5. Empirical Results and Comparative Performance

Performance is benchmarked on Highway Fuel Economy Test (HFET), US06, and Heavy Heavy-Duty Diesel Truck (HHDDT) cycles using a MATLAB/Simulink forward model for the series HEV system. Dynamic Programming (DP) provides a strong baseline for minimum fuel penalty.

Main Performance Table

Cycle Metric DP FFN SAC ΔFFN % GRU SAC ΔGRU % DT SAC ΔDT %
HFET MPG 23.71 20.73 –12.6 21.07 –11.1 21.68 –8.5
HFET SOCf_f 15.55% 15.81% +1.7 15.10% –2.9 15.38% –1.1
US06 MPG 4.63 4.27 –7.7 4.43 –4.2 4.04 –12.7
US06 SOCf_f 16.44% 14.67% –10.7 15.63% –4.9 17.58% +6.9
HHDDT MPG 21.83 18.82 –13.8 19.04 –12.8 20.75 –4.9
HHDDT SOCf_f 16.45% 17.29% +5.1 15.59% –5.2 15.23% –7.4
  • On HFET, DT-SAC is within 1.8% fuel consumption of the DP baseline. GRU-SAC is within 3.16%, and FFN within 3.43%.
  • On held-out US06 and HHDDT cycles, sequence-aware agents (GRU, DT) exhibited superior generalization, with DT-SAC showing the best alignment with the DP reference.
  • Ablation studies confirm that sequence-aware architectures train faster and are more robust to variations in SOC, drive cycle length, and input sequence parameters.

6. Significance of Sequence Modeling Approaches

Sequence modeling is essential in engine control tasks due to the propagation of state (especially SOC) and the stochasticity in power demand. Key distinctions:

  • Feedforward networks: State/action evaluation is i.i.d.; fails to leverage sequential dependencies in system dynamics.
  • GRUs: Efficiently encode short- and mid-term temporal dependencies via hidden-state recurrence; performance saturates with longer context windows.
  • Decision Transformers: Utilize self-attention across long sequences, enabling the agent to capture complex, potentially long-horizon cause-effect relations ("return-to-go" context modeling). DTs achieve higher sample complexity but are more computationally demanding, and can introduce output variability ("jitter") over long sequences.

Potential architectural extensions include prioritized/sequential replay, n-step returns in transformer-critics, LSTM or Temporal Convolutional Networks, world models for model-based SAC, and policy distillation for real-time embedded inference.

7. Extensions, Applications, and Outlook

T-SAC's modular sequence-aware enhancements to the classical SAC agent are directly applicable to any RL domain that exhibits strong temporal dependencies and variable outcome propagation, beyond powertrain control—for example, in robotics, process control, and networked systems. Efficacy is contingent on matching sequence model complexity (e.g., GRU vs DT) to scenario-specific memory requirements and hardware constraints.

Given T-SAC’s robust empirical generalization under unseen drive cycles and stochastic battery initializations, a plausible implication is its potential utility in production-ready HEV controllers subject to regulatory fuel economy constraints and hardware-in-the-loop adaptation. Ongoing research directions include exploring memory-augmented networks, cross-modal context fusion, and distillation for embedded real-time deployment (Jaleel et al., 6 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to T-SAC Algorithm.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube