Dynamic Reward Scaling for Anomaly Detection

Updated 22 November 2025

The paper presents a unified DRSMT framework that integrates VAE-based noise filtering, LSTM-DQN sequential decision making, and dynamic reward scaling for enhanced anomaly detection.
The methodology leverages generative modeling and adaptive reward shaping to balance exploration and exploitation in processing multivariate sensor data.
Empirical results on SMD and WADI datasets show improved F1 scores and precision over baselines, underscoring the framework's efficiency in sparse labeled scenarios.

Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT) is a unified deep reinforcement learning framework that addresses the challenges of detecting anomalies in high-dimensional, multi-sensor time series data. The method is specifically designed for environments where labeled anomalies are scarce, sensor dependencies are subtle, and scalable, accurate monitoring is required. DRSMT integrates generative modeling, sequential decision processes, adaptive reward shaping, and selective data labeling to deliver state-of-the-art performance on industrial and critical infrastructure datasets (Golchin et al., 15 Nov 2025).

1. Framework Components and Workflow

The DRSMT architecture consists of four core modules: a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), a dynamic reward scaling mechanism, and an active learning loop. The interaction of these components proceeds as follows:

At each time step $t$ , a sliding window $s_t \in \mathbb{R}^{N_\mathrm{STEPS} \times d}$ containing synchronized measurements from $d$ sensors is extracted.
The window is flattened and processed through the VAE encoder–decoder, producing a reconstruction $\hat{x}_t$ and reconstruction loss $R_2(s_t) = \|x_t - \hat{x}_t\|^2$ .
Simultaneously, the window (with an action-indicator bit) is input to the LSTM-DQN, outputting action-values $Q(s_t,0)$ and $Q(s_t,1)$ .
An action $a_t \in \{0~\mathrm{(normal)}, 1~\mathrm{(anomaly)}\}$ is selected using the $\varepsilon$ -greedy policy.
The environment provides the true label $y_t \in \{0,1\}$ , and an extrinsic classification reward $R_1(s_t,a_t)$ is computed by:

$R_1(s_t, a_t) = \begin{cases} +10 & a_t=1,~y_t=1~(\mathrm{TP}) \ +1 & a_t=0,~y_t=0~(\mathrm{TN}) \ -1 & a_t=1,~y_t=0~(\mathrm{FP}) \ -10 & a_t=0,~y_t=1~(\mathrm{FN}) \end{cases}$

The dynamic reward module scales the VAE penalty with a time-dependent coefficient $\lambda(t)$ , giving the total reward:

$R_\mathrm{total}(s_t, a_t) = R_1(s_t, a_t) + \lambda(t)\,R_2(s_t)$

The tuple $(s_t, a_t, R_\mathrm{total}, s_{t+1})$ is stored in replay memory, and the DQN is updated with standard Bellman backups.
After each episode, $\lambda(t)$ is adjusted to optimize the exploration–exploitation trade-off. The active learning loop periodically selects the $K_\mathrm{AL}$ most uncertain windows for human labeling and $K_\mathrm{LP}$ for pseudo-labeling, with results injected into the replay buffer (Golchin et al., 15 Nov 2025).

2. Variational Autoencoder Design

The VAE receives a flattened window $x \in \mathbb{R}^{n \cdot d}$ :

Encoder: Two fully connected layers (128 ReLU units each) output mean $\mu(x) \in \mathbb{R}^k$ and log-variance $\log \sigma^2(x) \in \mathbb{R}^k$ .
Reparameterization: $z = \mu(x) + \sigma(x) \odot \epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ .
Decoder: Mirror of the encoder layers, outputs $\hat{x} \in \mathbb{R}^{n \cdot d}$ .
Loss (ELBO, mean squared reconstruction penalty plus KL divergence):

$\mathcal{L}_\mathrm{VAE}(\theta, \phi; x) = \|x - \hat{x}\|^2 + \frac{1}{2}\sum_{i=1}^k \left( \mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1 \right)$

By compressing high-dimensional time windows into a latent manifold, the VAE acts as a noise filter and models cross-sensor dependencies critical for anomaly detection (Golchin et al., 15 Nov 2025).

3. LSTM Deep Q-Network Approach

State: $s_t = [x_{t-N_\mathrm{STEPS}+1},...,x_t] \in \mathbb{R}^{N_\mathrm{STEPS} \times d}$ , with optional action-conditioning ( $s_t^a$ ).
Action: $a_t \in \{0,1\}$ , with “0” as “normal” and “1” as “anomaly”.
Reward: Total reward as defined above, mixing extrinsic and intrinsic signals.
Architecture: LSTM layer (64 hidden units), fully connected output producing $Q(s_t,0)$ and $Q(s_t,1)$ .
Training:
- Discount factor: $\gamma=0.99$ .
- Adam optimizer with $\alpha_\mathrm{DQN}=10^{-4}$ .
- Replay buffer: $10^6$ samples, batch size $128$.
- Target network sync every $1000$ steps.
- $\varepsilon$ -greedy strategy (decayed from $1.0 \rightarrow 0.1$ over $10^4$ steps).
- Bellman update:
$y^{(i)} = r^{(i)} + \gamma \max_{a'} Q_\mathrm{target}(s'^{(i)}, a')$

with mean squared error loss over minibatch (Golchin et al., 15 Nov 2025).

4. Dynamic Reward Scaling and Exploration–Exploitation

DRSMT leverages dynamic reward scaling to balance exploration (via VAE reconstruction error) and exploitation (via classification accuracy):

Dynamic $\lambda(t)$ Update:

$\lambda_{t+1} = \mathrm{clip}\left( \lambda_t + \alpha_\lambda (R_\mathrm{target} - R^\mathrm{episode}),\, \lambda_\mathrm{min}, \lambda_\mathrm{max} \right)$

If $R^\mathrm{episode}<R_\mathrm{target}$ , increase $\lambda$ (prioritize exploration).
If $R^\mathrm{episode}\geq R_\mathrm{target}$ , decrease $\lambda$ (focus on exploitation).
- Exploration–Exploitation:
Early training: High $\lambda$ yields a stronger emphasis on the intrinsic (VAE) reward for broad exploration.
Later training: As $\lambda$ decays, the system exploits extrinsic classification performance, refining the anomaly boundary.

This proportional control on the reward signal adapts exploration to the relative abundance or scarcity of anomalies, a crucial property in sparse anomaly regimes (Golchin et al., 15 Nov 2025); similar mechanisms have been analyzed theoretically in related work (Golchin et al., 25 Aug 2025).

5. Active Learning and Label Budget Efficiency

The active learning strategy targets labeling efficiency under limited human supervision:

Uncertainty Scoring: For every unlabeled $s$ , compute margin: $\mathrm{Margin}(s) = |Q(s,0) - Q(s,1)|$ .
Sample Selection: Rank windows by margin; select $K_\mathrm{AL}$ for human labeling (lowest margins, highest uncertainty).
Pseudo-Labeling: Next $K_\mathrm{LP}$ are labeled using label-spreading:

$P(y_i \mid x_i) = \frac{\sum_{j \in \mathcal{L}} w_{ij} P(y_j \mid x_j)}{\sum_j w_{ij}}$

with similarity weights $w_{ij}=\exp(-\|x_i-x_j\|^2/\sigma^2)$ .

All newly labeled samples are stored in the replay buffer.

Heuristic configuration is 5% labeling for $K_\mathrm{AL}$ and 5% for $K_\mathrm{LP}$ per episode, optimizing supervision budgets while accelerating convergence and robustness (Golchin et al., 15 Nov 2025).

6. Empirical Results and Benchmark Comparisons

Experiments on the SMD and WADI multivariate datasets illustrate DRSMT’s performance gains:

Model	Metric	SMD	WADI
CARLA	F1	0.5114	0.2953
DRSMT	Precision	0.9608	0.1971
	Recall	0.5733	0.7539
	F1	0.7181	0.3125
	AU-PR	0.5712	0.1290

Datasets:
- SMD: $708,405$ train, $708,420$ test, $4.16\%$ anomalies, $38$ sensors.
- WADI: $784,568$ train, $172,801$ test, $5.77\%$ anomalies, $123$ sensors.
Baselines: LSTM-VAE, OmniAnomaly, MTAD-GAT, AnomalyTransformer, TS2Vec, DCDetector, TimesNet, Random, CARLA.
Metrics: Precision, recall, F1, AU-PR.
Performance: DRSMT yields F1 improvements of $0.7181$ (SMD) and $0.3125$ (WADI), outperforming previous best CARLA (F1 $0.5114$ and $0.2953$).

Ablation studies confirm that omitting dynamic $\lambda(t)$ or active learning substantially degrades accuracy and convergence rate (Golchin et al., 15 Nov 2025).

7. Relation to Prior Work and Theoretical Context

DRSMT builds on the principle of intrinsic–extrinsic reward shaping, extending dynamic reward scaling techniques from recent reinforcement learning approaches in time series anomaly detection. For related approaches in univariate/multivariate settings with similar dynamic control of intrinsic (reconstruction) and extrinsic (classification) rewards, see DRTA (Golchin et al., 25 Aug 2025). Both frameworks utilize proportional-control style adjustment of the reward-scaling coefficient to balance exploration and exploitation, combining with VAE-based generative modeling and active learning for label efficiency.

DRSMT distinguishes itself through its explicit LSTM-DQN sequential decision module, integration of dynamic reward scaling with a temporally adaptive $\lambda(t)$ , and its empirical evaluation on complex, high-dimensional industrial datasets. The benefit is robust, scalable anomaly detection under limited labeled data and in the presence of complex sensor dependencies (Golchin et al., 15 Nov 2025).

References:

"Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach" (Golchin et al., 15 Nov 2025)
"DRTA: Dynamic Reward Scaling for Reinforcement Learning in Time Series Anomaly Detection" (Golchin et al., 25 Aug 2025)