Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic Reward Scaling for Anomaly Detection

Updated 22 November 2025
  • The paper presents a unified DRSMT framework that integrates VAE-based noise filtering, LSTM-DQN sequential decision making, and dynamic reward scaling for enhanced anomaly detection.
  • The methodology leverages generative modeling and adaptive reward shaping to balance exploration and exploitation in processing multivariate sensor data.
  • Empirical results on SMD and WADI datasets show improved F1 scores and precision over baselines, underscoring the framework's efficiency in sparse labeled scenarios.

Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT) is a unified deep reinforcement learning framework that addresses the challenges of detecting anomalies in high-dimensional, multi-sensor time series data. The method is specifically designed for environments where labeled anomalies are scarce, sensor dependencies are subtle, and scalable, accurate monitoring is required. DRSMT integrates generative modeling, sequential decision processes, adaptive reward shaping, and selective data labeling to deliver state-of-the-art performance on industrial and critical infrastructure datasets (Golchin et al., 15 Nov 2025).

1. Framework Components and Workflow

The DRSMT architecture consists of four core modules: a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), a dynamic reward scaling mechanism, and an active learning loop. The interaction of these components proceeds as follows:

  1. At each time step tt, a sliding window stRNSTEPS×ds_t \in \mathbb{R}^{N_\mathrm{STEPS} \times d} containing synchronized measurements from dd sensors is extracted.
  2. The window is flattened and processed through the VAE encoder–decoder, producing a reconstruction x^t\hat{x}_t and reconstruction loss R2(st)=xtx^t2R_2(s_t) = \|x_t - \hat{x}_t\|^2.
  3. Simultaneously, the window (with an action-indicator bit) is input to the LSTM-DQN, outputting action-values Q(st,0)Q(s_t,0) and Q(st,1)Q(s_t,1).
  4. An action at{0 (normal),1 (anomaly)}a_t \in \{0~\mathrm{(normal)}, 1~\mathrm{(anomaly)}\} is selected using the ε\varepsilon-greedy policy.
  5. The environment provides the true label yt{0,1}y_t \in \{0,1\}, and an extrinsic classification reward R1(st,at)R_1(s_t,a_t) is computed by:

R1(st,at)={+10at=1, yt=1 (TP) +1at=0, yt=0 (TN) 1at=1, yt=0 (FP) 10at=0, yt=1 (FN)R_1(s_t, a_t) = \begin{cases} +10 & a_t=1,~y_t=1~(\mathrm{TP}) \ +1 & a_t=0,~y_t=0~(\mathrm{TN}) \ -1 & a_t=1,~y_t=0~(\mathrm{FP}) \ -10 & a_t=0,~y_t=1~(\mathrm{FN}) \end{cases}

  1. The dynamic reward module scales the VAE penalty with a time-dependent coefficient λ(t)\lambda(t), giving the total reward:

Rtotal(st,at)=R1(st,at)+λ(t)R2(st)R_\mathrm{total}(s_t, a_t) = R_1(s_t, a_t) + \lambda(t)\,R_2(s_t)

  1. The tuple (st,at,Rtotal,st+1)(s_t, a_t, R_\mathrm{total}, s_{t+1}) is stored in replay memory, and the DQN is updated with standard Bellman backups.
  2. After each episode, λ(t)\lambda(t) is adjusted to optimize the exploration–exploitation trade-off. The active learning loop periodically selects the KALK_\mathrm{AL} most uncertain windows for human labeling and KLPK_\mathrm{LP} for pseudo-labeling, with results injected into the replay buffer (Golchin et al., 15 Nov 2025).

2. Variational Autoencoder Design

The VAE receives a flattened window xRndx \in \mathbb{R}^{n \cdot d}:

  • Encoder: Two fully connected layers (128 ReLU units each) output mean μ(x)Rk\mu(x) \in \mathbb{R}^k and log-variance logσ2(x)Rk\log \sigma^2(x) \in \mathbb{R}^k.
  • Reparameterization: z=μ(x)+σ(x)ϵz = \mu(x) + \sigma(x) \odot \epsilon, where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).
  • Decoder: Mirror of the encoder layers, outputs x^Rnd\hat{x} \in \mathbb{R}^{n \cdot d}.
  • Loss (ELBO, mean squared reconstruction penalty plus KL divergence):

LVAE(θ,ϕ;x)=xx^2+12i=1k(μi2+σi2logσi21)\mathcal{L}_\mathrm{VAE}(\theta, \phi; x) = \|x - \hat{x}\|^2 + \frac{1}{2}\sum_{i=1}^k \left( \mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1 \right)

By compressing high-dimensional time windows into a latent manifold, the VAE acts as a noise filter and models cross-sensor dependencies critical for anomaly detection (Golchin et al., 15 Nov 2025).

3. LSTM Deep Q-Network Approach

  • State: st=[xtNSTEPS+1,...,xt]RNSTEPS×ds_t = [x_{t-N_\mathrm{STEPS}+1},...,x_t] \in \mathbb{R}^{N_\mathrm{STEPS} \times d}, with optional action-conditioning (stas_t^a).
  • Action: at{0,1}a_t \in \{0,1\}, with “0” as “normal” and “1” as “anomaly”.
  • Reward: Total reward as defined above, mixing extrinsic and intrinsic signals.
  • Architecture: LSTM layer (64 hidden units), fully connected output producing Q(st,0)Q(s_t,0) and Q(st,1)Q(s_t,1).
  • Training:
    • Discount factor: γ=0.99\gamma=0.99.
    • Adam optimizer with αDQN=104\alpha_\mathrm{DQN}=10^{-4}.
    • Replay buffer: 10610^6 samples, batch size $128$.
    • Target network sync every $1000$ steps.
    • ε\varepsilon-greedy strategy (decayed from 1.00.11.0 \rightarrow 0.1 over 10410^4 steps).
    • Bellman update:

    y(i)=r(i)+γmaxaQtarget(s(i),a)y^{(i)} = r^{(i)} + \gamma \max_{a'} Q_\mathrm{target}(s'^{(i)}, a')

    with mean squared error loss over minibatch (Golchin et al., 15 Nov 2025).

4. Dynamic Reward Scaling and Exploration–Exploitation

DRSMT leverages dynamic reward scaling to balance exploration (via VAE reconstruction error) and exploitation (via classification accuracy):

  • Dynamic λ(t)\lambda(t) Update:

λt+1=clip(λt+αλ(RtargetRepisode),λmin,λmax)\lambda_{t+1} = \mathrm{clip}\left( \lambda_t + \alpha_\lambda (R_\mathrm{target} - R^\mathrm{episode}),\, \lambda_\mathrm{min}, \lambda_\mathrm{max} \right)

  • If Repisode<RtargetR^\mathrm{episode}<R_\mathrm{target}, increase λ\lambda (prioritize exploration).
  • If RepisodeRtargetR^\mathrm{episode}\geq R_\mathrm{target}, decrease λ\lambda (focus on exploitation).

    • Exploration–Exploitation:
  • Early training: High λ\lambda yields a stronger emphasis on the intrinsic (VAE) reward for broad exploration.
  • Later training: As λ\lambda decays, the system exploits extrinsic classification performance, refining the anomaly boundary.

This proportional control on the reward signal adapts exploration to the relative abundance or scarcity of anomalies, a crucial property in sparse anomaly regimes (Golchin et al., 15 Nov 2025); similar mechanisms have been analyzed theoretically in related work (Golchin et al., 25 Aug 2025).

5. Active Learning and Label Budget Efficiency

The active learning strategy targets labeling efficiency under limited human supervision:

  1. Uncertainty Scoring: For every unlabeled ss, compute margin: Margin(s)=Q(s,0)Q(s,1)\mathrm{Margin}(s) = |Q(s,0) - Q(s,1)|.
  2. Sample Selection: Rank windows by margin; select KALK_\mathrm{AL} for human labeling (lowest margins, highest uncertainty).
  3. Pseudo-Labeling: Next KLPK_\mathrm{LP} are labeled using label-spreading:

P(yixi)=jLwijP(yjxj)jwijP(y_i \mid x_i) = \frac{\sum_{j \in \mathcal{L}} w_{ij} P(y_j \mid x_j)}{\sum_j w_{ij}}

with similarity weights wij=exp(xixj2/σ2)w_{ij}=\exp(-\|x_i-x_j\|^2/\sigma^2).

  1. All newly labeled samples are stored in the replay buffer.

Heuristic configuration is 5% labeling for KALK_\mathrm{AL} and 5% for KLPK_\mathrm{LP} per episode, optimizing supervision budgets while accelerating convergence and robustness (Golchin et al., 15 Nov 2025).

6. Empirical Results and Benchmark Comparisons

Experiments on the SMD and WADI multivariate datasets illustrate DRSMT’s performance gains:

Model Metric SMD WADI
CARLA F1 0.5114 0.2953
DRSMT Precision 0.9608 0.1971
Recall 0.5733 0.7539
F1 0.7181 0.3125
AU-PR 0.5712 0.1290
  • Datasets:
    • SMD: $708,405$ train, $708,420$ test, 4.16%4.16\% anomalies, $38$ sensors.
    • WADI: $784,568$ train, $172,801$ test, 5.77%5.77\% anomalies, $123$ sensors.
  • Baselines: LSTM-VAE, OmniAnomaly, MTAD-GAT, AnomalyTransformer, TS2Vec, DCDetector, TimesNet, Random, CARLA.
  • Metrics: Precision, recall, F1, AU-PR.
  • Performance: DRSMT yields F1 improvements of $0.7181$ (SMD) and $0.3125$ (WADI), outperforming previous best CARLA (F1 $0.5114$ and $0.2953$).

Ablation studies confirm that omitting dynamic λ(t)\lambda(t) or active learning substantially degrades accuracy and convergence rate (Golchin et al., 15 Nov 2025).

7. Relation to Prior Work and Theoretical Context

DRSMT builds on the principle of intrinsic–extrinsic reward shaping, extending dynamic reward scaling techniques from recent reinforcement learning approaches in time series anomaly detection. For related approaches in univariate/multivariate settings with similar dynamic control of intrinsic (reconstruction) and extrinsic (classification) rewards, see DRTA (Golchin et al., 25 Aug 2025). Both frameworks utilize proportional-control style adjustment of the reward-scaling coefficient to balance exploration and exploitation, combining with VAE-based generative modeling and active learning for label efficiency.

DRSMT distinguishes itself through its explicit LSTM-DQN sequential decision module, integration of dynamic reward scaling with a temporally adaptive λ(t)\lambda(t), and its empirical evaluation on complex, high-dimensional industrial datasets. The benefit is robust, scalable anomaly detection under limited labeled data and in the presence of complex sensor dependencies (Golchin et al., 15 Nov 2025).


References:

  • "Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach" (Golchin et al., 15 Nov 2025)
  • "DRTA: Dynamic Reward Scaling for Reinforcement Learning in Time Series Anomaly Detection" (Golchin et al., 25 Aug 2025)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT).