Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoDRL-IDS: Carbon-Aware DRL Intrusion Detection

Updated 30 November 2025
  • AutoDRL-IDS is a DRL-based intrusion detection system that uses LSTM sequence encoding and policy learning for adaptive, real-time cyberattack detection.
  • It combines supervised pretraining with techniques like DQN to achieve high detection accuracy and low false alarm rates in resource-constrained edge deployments.
  • The system employs a multi-objective reward design that balances traditional detection metrics with energy efficiency and reduced carbon emissions for sustainable operation.

AutoDRL-IDS denotes a class of Deep Reinforcement Learning-Intrusion Detection Systems that utilize supervised learning and sequence modeling (typically via LSTM encoders) for efficient, adaptive cyberattack detection under real-world constraints, notably in settings such as IoT edge gateways and cyber-physical infrastructures. The defining feature is the integration of temporally-aware feature encoding with policy learning, frequently enhanced by multi-objective optimization criteria that explicitly balance classical detection accuracy and sustainability metrics such as energy consumption and carbon emissions (Jamshidi et al., 23 Nov 2025, Al-Mehdhar et al., 2024).

1. Core Model Architecture

AutoDRL-IDS is fundamentally structured as a two-stage system: a feature-sequence encoder (typically a unidirectional LSTM) followed by a DRL agent, which is commonly instantiated as a Deep Q-Network (DQN) or actor-critic method. The canonical LSTM encoder processes dd-dimensional traffic features ftRd\mathbf{f}_t \in \mathbb{R}^d, propagating its hidden state htRHh_t \in \mathbb{R}^H through cell dynamics:

it=σ(Wift+Uiht1+bi) ft=σ(Wfft+Ufht1+bf) ot=σ(Woft+Uoht1+bo) c~t=tanh(Wcft+Ucht1+bc) ct=ftct1+itc~t ht=ottanh(ct)\begin{align*} i_t &= \sigma(W_i f_t + U_i h_{t-1} + b_i)\ f_t' &= \sigma(W_f f_t + U_f h_{t-1} + b_f)\ o_t &= \sigma(W_o f_t + U_o h_{t-1} + b_o)\ \tilde{c}_t &= \tanh(W_c f_t + U_c h_{t-1} + b_c)\ c_t &= f_t' \odot c_{t-1} + i_t \odot \tilde{c}_t\ h_t &= o_t \odot \tanh(c_t) \end{align*}

where σ\sigma is the sigmoid activation and \odot denotes element-wise multiplication. hth_t is used as an embedding for both the supervised classifier and as a summary input to the DRL agent.

The DQN policy receives a state vector sts_t encompassing selected raw traffic metrics (e.g., packet rate, SYN/ACK counts), the LSTM hidden state hth_t, and optionally an anomaly score As=xtx^t2A_s = \|x_t - \hat{x}_t\|^2 derived from an auxiliary autoencoder. The action space typically includes discrete mitigation strategies (e.g., rate-limiting, packet dropping, IP blacklisting), suitable for real-time control on edge platforms.

2. Supervised Pretraining and DRL Integration

Supervised pretraining of the LSTM encoder is conducted on labeled datasets (e.g., Bot-IoT), optimizing a binary cross-entropy loss: Lsup=1Nt[ytlogy^t+(1yt)log(1y^t)]\mathcal{L}_{sup} = -\frac{1}{N}\sum_t [y_t \log \hat{y}_t + (1-y_t)\log(1-\hat{y}_t)] where y^t=σ(vht+b)\hat{y}_t = \sigma(v^\top h_t + b). After convergence, LSTM weights are either frozen or lightly fine-tuned during DRL policy learning.

At runtime, the feature extraction and encoding are executed online, and the DQN agent selects the optimal action given the augmented state for each time step by maximizing Q(st,a;θQ)Q(s_t, a; \theta_Q). Bellman updates are applied over experience replay buffers, with target networks for stability: yt=rt+γmaxaQ(st+1,a;θQ)y_t = r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta_Q^-)

θQθQηθQ[Q(st,at;θQ)yt]2\theta_Q \leftarrow \theta_Q - \eta \nabla_{\theta_Q}[Q(s_t, a_t;\theta_Q) - y_t]^2

Experiments typically use Adam optimizer, batch sizes of 64, and replay buffers up to 50,000 samples.

3. Carbon-Aware and Multi-Objective Reward Design

AutoDRL-IDS incorporates a multi-objective reward function to enforce trade-offs between detection efficacy and sustainability. The reward at time tt is: Rt=αDRtβFPRtλLLtδEtϵMtζCtR_t = \alpha \cdot DR_t - \beta \cdot FPR_t - \lambda_L L_t - \delta E_t - \epsilon M_t - \zeta C_t with DRtDR_t (detection reward), FPRtFPR_t (false positives), LtL_t (latency), EtE_t (energy cost), MtM_t (memory utilization ratio), and CtC_t (carbon emission) defined as:

  • Et=PtΔtE_t = P_t \Delta t (energy, J)
  • Ct=κtEtC_t = \kappa_t E_t, κt\kappa_t (carbon intensity, gCO2_2 per J)

A simplified formulation used in practice is: Rt=αdetectRdetect,t+βenergy(Et)γCtR_t = \alpha_{detect} R_{detect,t} + \beta_{energy}( - E_t ) - \gamma C_t This structure allows practitioners to tune policy sensitivity according to security and sustainability constraints of the deployment context (Jamshidi et al., 23 Nov 2025).

4. Application Domains and Deployment Results

AutoDRL-IDS implementations have demonstrated efficacy across domains. In EV charging station cyber-defense, the system couples a DRL adversary (LSTM or Transformer policy gradient agent) that synthesizes stealthy state-of-charge (SoC) attacks with a robust PPO-based IDS. A hierarchical adversarial training architecture is used: the attack generator produces adversarial datasets by maximizing its own utility against a fixed baseline IDS, while the IDS is trained to minimize detection loss on these challenging samples, yielding a saddle-point optimization. This scheme achieved near-perfect detector generalization: in real-world traces (536 taxis, 24 days), Transformer-based IDS models maintained >99.5%>99.5\% accuracy and <0.5%<0.5\% false-alarm rates, including under attacks not seen in training (Al-Mehdhar et al., 2024).

On Raspberry Pi 4/ESP32 gateways for IoT edge DDoS detection, supervised LSTM-DQN AutoDRL-IDS reached 94.0%94.0\% accuracy, 91.7%91.7\% precision, 92.0%92.0\% recall, and 91.3%91.3\% F1, with a mean response time of $0.50$ s and significant reductions in energy (\sim20–30\%) and carbon (\sim10–15\%) versus traditional RL-IDS, at $0.03$ kgCO2_2 per five-minute detection window (Jamshidi et al., 23 Nov 2025).

Domain Architecture Accuracy F1 Notable Metrics
EV Charging Cyberdefense Transformer–PPO2 0.999 0.999 <<0.5\% FAR, Unseen attack robust
IoT Edge Gateway LSTM–DQN 0.94 0.913 0.03 kgCO2_2/5 min, <<0.5s latency

5. Comparative Analysis and Limitations

Relative to unsupervised DRL-IDS (e.g., DeepEdgeIDS), the supervised AutoDRL-IDS variant provides highest-precision detection for attack types seen in labeled datasets. The LSTM-based temporal encoding enhances the ability to recognize evolving traffic patterns and supports smoother DRL policy updates. The introduction of carbon-aware rewards represents a distinct advance by explicitly incentivizing sustainability in operational deployments, crucial for edge platforms with limited energy budgets.

Limitations include reduced adaptability to zero-day attacks compared to unsupervised or self-supervised hybrids and reliance on labeled data, which may hinder transferability to new environments. Periodic DRL updates can incur resource overhead even in static regimes—a plausible implication is the benefit of event-triggered or adaptive DRL retraining strategies. Model compression (pruning, quantization) and federated DRL distribution are suggested avenues for future work to mitigate resource and scalability constraints (Jamshidi et al., 23 Nov 2025).

6. Implementation and Replication Details

Experimental configurations adopt standard open-source toolchains. LSTM pretraining uses Adam optimizer (lr=103\text{lr}=10^{-3}, batch size =64=64, epochs \approx10–20). DRL training leverages DQN agents with ε\varepsilon-greedy exploration (annealed from 1.0 to 0.1), discount factor γ=0.99\gamma=0.99, replay buffer up to 50k samples, and periodic target network updates. For adversarial EV-charging scenarios, training uses N=500N=500 adversary and M200M\approx 200 IDS episodes, T=48T=48 time slots per episode; IDS train using SGD with learning rate α{4×103,4×104,4×105}\alpha \in \{4\times10^{-3}, 4\times10^{-4}, 4\times10^{-5}\} (best at 4×1054\times10^{-5}), and PPO clipping ϵ=0.2\epsilon=0.2 (Al-Mehdhar et al., 2024).

Synthetic dataset balancing (e.g., via ADASYN) is applied to ensure class parity. Hardware evaluations on Raspberry Pi 4 and ESP32 confirm real-time operation under resource and energy constraints for edge IoT use cases, with CPU utilization 25%\sim25\% and memory 50%\sim50\% (Jamshidi et al., 23 Nov 2025).

7. Broader Impact and Future Directions

AutoDRL-IDS extends the practical frontier of DRL for cyber-physical anomaly detection by jointly optimizing for detection quality and environmental sustainability—addressing a growing need in resource-constrained IoT infrastructures and cyber-physical systems. Carbon-aware objectives, lightweight LSTM encoding, and DQN policy architectures make it a candidate for broad adoption in green security frameworks.

Potential enhancements include incorporation of meta-learning or semi-supervised DRL to elevate adaptivity to evolving or unseen threats, federated training schemes for distributed IoT deployments, and rigorous model compression for extreme edge/embedded platforms. A plausible implication is that advances in reward shaping and multi-agent adversarial training will strengthen robustness and generalization of IDS policies against increasingly sophisticated attack strategies.

References:

  • "Charging Ahead: A Hierarchical Adversarial Framework for Counteracting Advanced Cyber Threats in EV Charging Stations" (Al-Mehdhar et al., 2024)
  • "Carbon-Aware Intrusion Detection: A Comparative Study of Supervised and Unsupervised DRL for Sustainable IoT Edge Gateways" (Jamshidi et al., 23 Nov 2025)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoDRL-IDS.