T-SAC: Sequence-Aware Soft Actor-Critic
- T-SAC Algorithm is a reinforcement learning method that incorporates temporal sequence modeling using GRUs or transformers to capture historical context.
- It improves performance, robustness, and generalization in complex sequential decision-making tasks, notably for engine control in hybrid vehicles.
- By replacing standard feedforward networks with sequence-aware architectures, T-SAC enables efficient handling of long-horizon dependencies and dynamic system variations.
T-SAC (Sequence-Aware Soft Actor-Critic) refers to a class of algorithms that enhance the Soft Actor-Critic (SAC) framework for reinforcement learning (RL) by equipping both actor and critic networks with explicit mechanisms for temporal sequence modeling. The explicit goal is to improve performance, robustness, and generalization in challenging real-world sequential decision-making tasks—most notably, engine control optimization in hybrid electric vehicle (HEV) powertrains—by leveraging either recurrent neural networks or attention-based models to capture temporal dependencies in the policy and value functions (Jaleel et al., 6 Aug 2025).
1. Problem Setting and Motivation
T-SAC targets control problems that exhibit strong temporal dependencies, as is typical in electrified powertrain management for series-hybrid electric vehicles. These systems require coordinated decision-making over extended horizons to optimize fuel consumption and maintain desirable battery state-of-charge (SOC) across variable operating conditions. The high-dimensional, continuous control space (e.g., engine speed and torque) compounds the challenge, and the returns at any step depend non-trivially on the sequence of prior states and actions. Standard SAC algorithms (using feedforward networks as encoders) fail to retain adequate temporal context, treating each observation as i.i.d., and thus often underperform in such settings.
In T-SAC, the SAC algorithm is augmented such that both the policy ("actor") and value function ("critic") networks are sequence-aware: specifically, by integrating Gated Recurrent Units (GRUs) or Decision Transformers (DTs). This enables the networks to encode and reason over relevant historical and predictive context (Jaleel et al., 6 Aug 2025).
2. Markov Decision Process Formulation and Losses
The engine control problem is posed as an infinite-horizon Markov Decision Process with:
- State : SOC, cumulative distance , electric machine power demand
- Action : engine speed and torque
- Transition : energy-balance SIMULINK model
- Reward : penalizes fuel rate and penalizes/encourages SOC based on deviations from [15%, 85%] using piecewise weights
- Inputs/outputs normalized to
Baseline SAC maximizes expected cumulative reward plus entropy:
The actor and critic are trained via standard policy and Q-function losses (including double Q for overestimation bias) and a temperature loss that adapts :
3. Sequence-Aware Architectures
T-SAC augments the SAC framework by replacing feedforward actor and critic mappings with temporal sequence modeling modules:
GRU-Augmented Actor and Critic
- Actor: 2-layer GRU (); input is concatenated .
- At each step:
- Action sampled (tanh-squashed)
- Critic: 2-layer GRU, input for each time index in a sequence; outputs for each .
- Loss: critic loss is the sum of mean-squared errors across all steps in the sampled sequence.
Decision Transformer (DT) Actor and Critic
- Tokens: Each token consists of , i.e., return-to-go, state, action.
- Embedding layer: Projects tokens to with positional encoding.
- Transformer: single-layer, 4-head causal self-attention; outputs action parameters.
- Critic: predicts next return given tokens up to current time; loss is MSE between forecast and target return.
Sequence length is a key hyperparameter: GRUs use (effective for short-term dependencies), DTs use (for long-range context).
4. Algorithmic Workflow and Training Considerations
High-level training pseudocode for T-SAC:
- Initialize networks (, , ), targets, buffer .
- At each env step :
- For GRU/DT: if , pad sequence start.
- Extract sequence .
- Compute policy action via GRU or transformer given .
- Step environment, observe , , store trajectory in .
- Each update:
- Sample sequences (GRU/DT) or transitions (FFN) from .
- Compute critic loss over sequence (GRU/DT) or per transition (FFN).
- Compute actor loss and temperature loss.
- Apply gradient update; softly update target Q.
Primary hyperparameters:
| Hyperparameter | FFN | GRU | DT |
|---|---|---|---|
| Learning rate (actor/critic) | |||
| Batch size | 64 | 64 | 64 |
| Sequence length | – | 10 | 100 |
| Actor/critic architecture | [128,128] | 2-layer GRU(128) | 1-layer Transformer (128) |
| Entropy (auto-tuned) | Yes | Yes | Yes |
| Gradient clipping | None | $0.25$ | None |
| Training frequency | every 5 | every 25 | every 50 env steps |
5. Empirical Results and Comparative Performance
Performance is benchmarked on Highway Fuel Economy Test (HFET), US06, and Heavy Heavy-Duty Diesel Truck (HHDDT) cycles using a MATLAB/Simulink forward model for the series HEV system. Dynamic Programming (DP) provides a strong baseline for minimum fuel penalty.
Main Performance Table
| Cycle | Metric | DP | FFN SAC | ΔFFN % | GRU SAC | ΔGRU % | DT SAC | ΔDT % |
|---|---|---|---|---|---|---|---|---|
| HFET | MPG | 23.71 | 20.73 | –12.6 | 21.07 | –11.1 | 21.68 | –8.5 |
| HFET | SOC | 15.55% | 15.81% | +1.7 | 15.10% | –2.9 | 15.38% | –1.1 |
| US06 | MPG | 4.63 | 4.27 | –7.7 | 4.43 | –4.2 | 4.04 | –12.7 |
| US06 | SOC | 16.44% | 14.67% | –10.7 | 15.63% | –4.9 | 17.58% | +6.9 |
| HHDDT | MPG | 21.83 | 18.82 | –13.8 | 19.04 | –12.8 | 20.75 | –4.9 |
| HHDDT | SOC | 16.45% | 17.29% | +5.1 | 15.59% | –5.2 | 15.23% | –7.4 |
- On HFET, DT-SAC is within 1.8% fuel consumption of the DP baseline. GRU-SAC is within 3.16%, and FFN within 3.43%.
- On held-out US06 and HHDDT cycles, sequence-aware agents (GRU, DT) exhibited superior generalization, with DT-SAC showing the best alignment with the DP reference.
- Ablation studies confirm that sequence-aware architectures train faster and are more robust to variations in SOC, drive cycle length, and input sequence parameters.
6. Significance of Sequence Modeling Approaches
Sequence modeling is essential in engine control tasks due to the propagation of state (especially SOC) and the stochasticity in power demand. Key distinctions:
- Feedforward networks: State/action evaluation is i.i.d.; fails to leverage sequential dependencies in system dynamics.
- GRUs: Efficiently encode short- and mid-term temporal dependencies via hidden-state recurrence; performance saturates with longer context windows.
- Decision Transformers: Utilize self-attention across long sequences, enabling the agent to capture complex, potentially long-horizon cause-effect relations ("return-to-go" context modeling). DTs achieve higher sample complexity but are more computationally demanding, and can introduce output variability ("jitter") over long sequences.
Potential architectural extensions include prioritized/sequential replay, n-step returns in transformer-critics, LSTM or Temporal Convolutional Networks, world models for model-based SAC, and policy distillation for real-time embedded inference.
7. Extensions, Applications, and Outlook
T-SAC's modular sequence-aware enhancements to the classical SAC agent are directly applicable to any RL domain that exhibits strong temporal dependencies and variable outcome propagation, beyond powertrain control—for example, in robotics, process control, and networked systems. Efficacy is contingent on matching sequence model complexity (e.g., GRU vs DT) to scenario-specific memory requirements and hardware constraints.
Given T-SAC’s robust empirical generalization under unseen drive cycles and stochastic battery initializations, a plausible implication is its potential utility in production-ready HEV controllers subject to regulatory fuel economy constraints and hardware-in-the-loop adaptation. Ongoing research directions include exploring memory-augmented networks, cross-modal context fusion, and distillation for embedded real-time deployment (Jaleel et al., 6 Aug 2025).