DeepEdgeIDS: Unsupervised DRL for IoT Edge Security

Updated 30 November 2025

DeepEdgeIDS is an unsupervised deep reinforcement learning IDS that combines an autoencoder for anomaly detection with a deep Q-network for dynamic DDoS mitigation on IoT edge gateways.
It employs a multi-objective reward function that balances detection accuracy, response latency, energy consumption, and carbon emissions to rapidly adapt to novel attacks.
Experimental results on edge hardware show superior detection rates, reduced latency, and explicit carbon-aware operation compared to supervised DRL alternatives.

DeepEdgeIDS is an unsupervised Deep Reinforcement Learning (DRL)-based Intrusion Detection System (IDS) designed for sustainable and adaptive Distributed Denial-of-Service (DDoS) mitigation in Internet of Things (IoT) edge gateways. Combining an Autoencoder (AE) for anomaly detection with a Deep Q-Network (DQN) for real-time, carbon-aware mitigation policy selection, DeepEdgeIDS addresses the limitations of traditional IDS—including poor adaptability to zero-day threats, dependence on static signatures or labels, and neglect of sustainability metrics such as energy consumption and carbon footprint. Experimental results demonstrate that DeepEdgeIDS attains higher detection accuracy, faster adaptation to novel attacks, and explicit carbon-aware operation when compared to supervised DRL alternatives (Jamshidi et al., 23 Nov 2025).

1. System Architecture and Operational Pipeline

DeepEdgeIDS processes raw network traffic at the edge gateway by employing an unsupervised AE to encode each traffic flow $x_t$ into a latent vector $h_t$ using

$h_t = f^{\text{encoder}}(x_t), \quad \hat{x}_t = f^{\text{decoder}}(h_t).$

The AE reconstruction loss, given by $L_{AE} = \|x_t - \hat{x}_t\|^2$ , serves as an anomaly score $A_s = \|x_t - \hat{x}_t\|^2$ . Flows with $A_s > \tau_a$ (where $\tau_a$ is set at the 95th percentile of scores on benign validation traffic) are flagged as suspicious. These flows are subsequently processed by the DQN policy network, whose input state is

$s_t = \{\text{P}_{\text{rate}}, \text{SYN}_{\text{count}}, \text{ACK}_{\text{count}}, A_s, h_t\}.$

The DQN provides action values $Q(s_t,a)$ over four discrete mitigation actions:

$a_1$ = rate limit,
$a_2$ = SYN-throttling,
$a_3$ = drop packets,
$a_4$ = blacklist IPs.

Selected actions are enforced at the gateway via dynamic rule updates (e.g., iptables). Experience replay (buffer size: 50,000; batch size: 64) and $\epsilon$ -greedy exploration strategies promote robust and stable off-policy learning.

2. Autoencoder-based Anomaly Detection

The AE component is optimized to minimize reconstruction loss:

$L_{AE} = \|x_t - \hat{x}_t\|^2.$

During live operation, $A_s$ is computed for each traffic flow. An event is deemed anomalous (i.e., DDoS suspected) if $A_s > \tau_a$ . This unsupervised anomaly detection scheme enables robust identification of zero-day patterns and non-stationary attack behaviors without dependence on labeled data.

3. Markov Decision Process Formulation and Reward Design

The core DRL mechanism of DeepEdgeIDS is modeled as a Markov Decision Process (MDP) where states, actions, and a multi-objective reward function are jointly optimized:

State space: $s_t = \{\text{P}_{\text{rate}}, \text{SYN}_{\text{count}}, \text{ACK}_{\text{count}}, A_s, h_t\}$ .
Action space: Four discrete edge-mitigating responses.

The multi-objective reward function is defined as:

$R(s_t, a_t) = \alpha \cdot DR_t - \beta \cdot FPR_t - \lambda_L \cdot L_{\text{resp}, t} - \delta \cdot E_{\text{overhead}, t} - \epsilon \cdot M_{\text{util}, t} - \zeta \cdot C_{\text{emission}, t},$

where:

$DR_t$ / $FPR_t$ : detection and false positive rates,
$L_{\text{resp}, t}$ : response latency,
$E_{\text{overhead}, t} = P_t \Delta t$ : energy consumed,
$M_{\text{util}, t} = M_{\text{active}} / M_{\text{total}}$ : memory utilization,
$C_{\text{emission}, t} \approx \kappa_t E_{\text{overhead}, t}$ : carbon footprint, with $\kappa_t$ as local grid carbon intensity.

This formulation ensures continuous adaptation toward high accuracy, low resource usage, minimal latency, and explicit minimization of carbon impact at the IoT edge.

4. DRL Algorithmic Structure and Network Implementation

DeepEdgeIDS employs a DQN policy:

$Q(s_t, a_t) \leftarrow R(s_t, a_t) + \gamma \max_{a'} Q(s_{t+1}, a'),$

with hyperparameters:

Discount $\gamma = 0.99$ ,
Learning rate $\eta \approx 1 \times 10^{-3}$ (Adam optimizer),
Exploration $\epsilon$ annealed from 1.0 to 0.01,
Experience replay buffer: 50,000 transitions; batch size: 64,
Convergence typically achieved after $2.5 \times 10^5$ steps.

The network architecture comprises:

Input: dimension matching $s_t$ (≈8 Bot-IoT features + latent dimension),
Two fully connected hidden layers (e.g., 128 units, ReLU),
Output: $|A| = 4$ linear Q-values.

5. Experimental Setup and Evaluation Methodology

Experiments were conducted on edge hardware (Raspberry Pi 4, 1.5 GHz quad-core, 8 GB RAM) with ESP32 IoT sensor nodes. The Bot-IoT dataset, originally 80 features, was reduced to 8 via variance analysis, Pearson correlation (0.85), and recursive elimination (retaining 95% predictive power). The threshold $\tau_a$ was set using the AE reconstruction error distribution on benign validation traffic.

Performance metrics and resource utilization were evaluated in both training and live deployment regimes. Statistical significance was assessed via ANOVA, highlighting differences in latency (F=75.62, p<0.05, $\eta^2$ =0.577) and detection probability (F=67.89, p<0.05).

6. Quantitative Results

DeepEdgeIDS achieves superior metrics relative to supervised DRL alternatives:

Metric	DeepEdgeIDS	AutoDRL-IDS
Accuracy	98.0%	94.0%
Precision	92.4%	—
Recall	97.6%	—
F1-score	94.9%	—
Live accuracy	97.0%	—
Live F1-score	93.0%	—
Missed packets/hr	2,640	12,740
Peak inference latency	9.2 ms	—
Response under DDoS	0.65 s	0.50 s
CPU usage (normal)	12.1%	—
CPU usage (attack)	39.6%	—
Peak energy (attack)	138.4 J	—

Under sustained DDoS, DeepEdgeIDS maintains detection probability of 97.6% versus 92.0% for AutoDRL-IDS. The trade-off is a ~20% higher peak CPU workload and approximately 8 J extra energy usage under peak attack, with carbon intensity $C_t = \kappa_t E_t$ computed from local grid data.

7. Comparative Adaptability and Sustainability Guarantees

DeepEdgeIDS’s unsupervised AE–DQN hybrid design enables faster adaptation to zero-day attacks through direct encoding of anomaly scores and latent representations into the DRL policy. Bellman contraction and Lyapunov stability provide robust policy convergence under nonstationary traffic patterns. These properties yield significantly lower missed detection rates and faster adaptation relative to supervised DRL approaches.

The system’s explicit carbon-aware reward structure ( $-\zeta \cdot C_t$ term) and multi-objective Lagrangian optimization enforce a Pareto-optimal balance among detection rate, energy usage, and carbon emissions. DeepEdgeIDS achieves 40% lower response latency (p<0.05) at the cost of higher computational effort, but with demonstrably bounded carbon and energy footprints, establishing it as a sustainable solution for real-time IoT edge IDS deployment (Jamshidi et al., 23 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Carbon-Aware Intrusion Detection: A Comparative Study of Supervised and Unsupervised DRL for Sustainable IoT Edge Gateways (2025)

DeepEdgeIDS: Unsupervised DRL for IoT Edge Security

1. System Architecture and Operational Pipeline

2. Autoencoder-based Anomaly Detection

3. Markov Decision Process Formulation and Reward Design

4. DRL Algorithmic Structure and Network Implementation

5. Experimental Setup and Evaluation Methodology

6. Quantitative Results

7. Comparative Adaptability and Sustainability Guarantees

Whiteboard

Follow Topic

Continue Learning

DeepEdgeIDS: Unsupervised DRL for IoT Edge Security

1. System Architecture and Operational Pipeline

2. Autoencoder-based Anomaly Detection

3. Markov Decision Process Formulation and Reward Design

4. DRL Algorithmic Structure and Network Implementation

5. Experimental Setup and Evaluation Methodology

6. Quantitative Results

7. Comparative Adaptability and Sustainability Guarantees

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics